User:Tejas godambe: Difference between revisions

From SMC Wiki
(Short description: The goal is to prepare text corpus, collect speech data from many native Malayalam speakers and build a strong baseline speaker-independent ASR system which lays a solid foundation for further R&D in ASR and other spoken language techno)
 
No edit summary
Line 1: Line 1:
'''Organization: Swathanthra Malayalam Computing'''
Short description: The goal is to prepare text corpus, collect speech data from many native Malayalam speakers and build a strong baseline speaker-independent ASR system which lays a solid foundation for further R&D in ASR and other spoken language technologies in Malayalam language, in future. Sphinx, Kaldi (with DNN, SGMM) can help us build ASR in quick time. We can also embed a speech recording app on SMC website or project blog to create a means of regular in-flow of real-world data to keep ASR updated.


'''Email Address:'''  tejas.godambe@gmail.com


Organization: Swathanthra Malayalam Computing
'''Telephone:''' 09821518774


Email Address: tejas.godambe@gmail.com
'''Freenode IRC Nick:''' tejasg


Telephone: 09821518774
'''Your university and current education:''' Pursuing MS by Research on the topic "Devising novel confidence measures for cross-lingual speech recognition" with Dr. Kishore Prahallad at International Institute of Information Technology, Hyderabad, India
 
Freenode IRC Nick: tejasg
 
Your university and current education: Pursuing MS by Research on the topic "Devising novel confidence measures for cross-lingual speech recognition" with Dr. Kishore Prahallad at International Institute of Information Technology, Hyderabad, India


   
   


Why do you want to work with the Swathanthra Malayalam Computing?
'''Why do you want to work with the Swathanthra Malayalam Computing?'''
 
I wish to work with SMC for the following reasons:
I wish to work with SMC for the following reasons:
1) SMC is a volunteering organization with an end goal of putting Indian languages on the map of graphical and spoken interfaces. This endeavor has several positive outcomes such as,
1) SMC is a volunteering organization with an end goal of putting Indian languages on the map of graphical and spoken interfaces. This endeavor has several positive outcomes such as,
     a. The middle-to-old-aged people (including my parents) who are scared of operating
     a. The middle-to-old-aged people (including my parents) who are scared of operating
 
      computers will be able to access and use computers confidently and efficiently.
        computers will be able to access and use computers confidently and efficiently.
      Technology will not remain "only-of-youth"
 
        Technology will not remain "only-of-youth"
 
     b. Language barriers will weaken. Vast majority of population will be able to reap benefits of
     b. Language barriers will weaken. Vast majority of population will be able to reap benefits of
 
      technology.
        technology.
 
     c. It will encourage people to speak and write in their respective mother tongues. If this
     c. It will encourage people to speak and write in their respective mother tongues. If this
 
      initiative scales up with time it will, in the long run, help avoid endangering of regional
        initiative scales up with time it will, in the long run, help avoid endangering of regional
      languages and also help preserve cultural heritage of India.
 
        languages and also help preserve cultural heritage of India.
 
2) I support the free and open source software initiative, but hadn’t been involved with any project until yet. Because of my past experience in building ASR systems, I feel confident to face challenges that encounter while building an efficient ASR system in Malayalam language for SMC, and it would be a great opportunity for me to engage, interact and learn the ethics and etiquettes practiced in open source community.
2) I support the free and open source software initiative, but hadn’t been involved with any project until yet. Because of my past experience in building ASR systems, I feel confident to face challenges that encounter while building an efficient ASR system in Malayalam language for SMC, and it would be a great opportunity for me to engage, interact and learn the ethics and etiquettes practiced in open source community.


   
   


Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?
'''Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?'''
 
No.
No.


   
   


Did you participate with the past GSoC programs, if so which years, which organizations?
'''Did you participate with the past GSoC programs, if so which years, which organizations?'''
 
No.
No.


   
   


Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment
'''Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment'''
 
No. I don't have any obligations between May and August. My advisor has permitted me   
No. I don't have any obligations between May and August. My advisor has permitted me   
to work full-time for GSoC '14.
to work full-time for GSoC '14.


   
   


Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2014 program, if yes, which area(s), you are interested in?
'''Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2014 program, if yes, which area(s), you are interested in?'''
 
Yes. I would like to remain engaged with SMC team after the GSoC project. My current inclination is toward ASR and text-to-speech system development and research.
Yes. I would like to remain engaged with SMC team after the GSoC project. My current inclination is toward ASR and text-to-speech system development and research.




Why should we choose you over other applicants?
'''Why should we choose you over other applicants?'''
 
1) I humbly believe that I will be able to face challenges while building ASR system for Malayalam language and quickly devise solutions whenever something fails. This confidence is just a product of my past experience of two years at TIFR, Mumbai and a year at IIIT-H, where I have been building and optimizing Marathi and Telugu ASR systems, and also performing cross-lingual experiments using Sphinx. It was in this process after building and debugging ASR systems several times that I became aware of the many nitty-gritty details of Sphinx. In the beginning, when I was new to Sphinx, I was overwhelmed by its vastness. But, with time I began to understand its internal workings. The links http://www.speech.cs.cmu.edu/sphinxman/, http://asr.cs.cmu.edu/ and speech recognition forum at sourceforge helped me a great deal to understand many intricacies of Sphinx and ASR in general.
1) I humbly believe that I will be able to face challenges while building ASR system for Malayalam language and quickly devise solutions whenever something fails. This confidence is just a product of my past experience of two years at TIFR, Mumbai and a year at IIIT-H, where I have been building and optimizing Marathi and Telugu ASR systems, and also performing cross-lingual experiments using Sphinx. It was in this process after building and debugging ASR systems several times that I became aware of the many nitty-gritty details of Sphinx. In the beginning, when I was new to Sphinx, I was overwhelmed by its vastness. But, with time I began to understand its internal workings. The links http://www.speech.cs.cmu.edu/sphinxman/, http://asr.cs.cmu.edu/ and speech recognition forum at sourceforge helped me a great deal to understand many intricacies of Sphinx and ASR in general.


I feel that a student who is not much versed with Sphinx and ASR in general may take extra time to learn things, largely because no good and in-detail documentation is available for Sphinx as it is for HTK.


2) Even if Sphinx builds statistical (data-driven) acoustic models which enables any non-native and non-speech researcher to build ASR system in hours, many times things such as debugging the actual error when error messages do not directly reflect the source of error, proper experiment-planning and decision making, having a feel and knowing what may work and what may not, which thing may may take how much time and is it worth the time/effort, come out of experience and from good knowledge of the theory. I feel that my experience at TIFR, Mumbai, speech processing courses credited at IIIT-H, things learnt by idling, asking questions on CMU Sphinx forum at sourceforge and those learnt from discussions with my fellow speech lab members working on different aspects of speech processing, and above all my liking toward speech technology and Human Computer Interaction give me an edge.
2) Even if Sphinx builds statistical (data-driven) acoustic models which enables a non-native and non-speech researcher to build ASR system in hours, many times things such as debugging the actual error when error messages do not directly point at the source of error, proper experiment-planning and decision making, having a feel and knowing what may work and what may not, which thing may may take how much time and is it worth the time/effort, come with experience and from good knowledge of the theory. I feel that my experience at TIFR, Mumbai, speech processing courses credited at IIIT-H, things learnt by idling, asking questions on CMU Sphinx forum at sourceforge and those learnt from discussions with my fellow speech lab members working on different aspects of speech processing, and above all my liking toward speech technology and Human Computer Interaction give me an edge.


   
   


An overview of your proposal
'''An overview of your proposal'''
 
The goal is to build a strong baseline speaker-independent Automatic Speech Recognition (ASR) system for Malayalam language. Since free multi-speaker speech data is not available for Malayalam language, it may be better to create one at this juncture. The 1000 specially designed Malayalam sentences (from Indic database), Malayalam quotes from web, text from Malayalam story books (if available of web) and sentences from different sections of newspapers such as Business, Health, Education, Entertainment etc which inherently carry diverse topics/themes/flavour, vocabulary, writing style shall be selected as transcriptions and shall also be used to create N-gram Language Model (LM). The text from above diverse sources shall be compiled into unique sets of 10 sentences each, and native Malayalam speakers (students and staff in IIIT-H belonging to different age-groups and gender) shall be asked to read out one or more sentence sets in silent environment (recording studio in IIIT-H) to create 2-3 hours of speech data (similar to TIMIT) and build a strong baseline system upon which further R&D in ASR for Malayalam language can be carried out in future. We will use automatically generated pronunciation dictionary to start with, and later refine it. Then, experiments to optimize typically tuned training parameters such as number of components in Gaussian mixture, number of tied states, etc, and decoding parameters such as Word Insertion Penalty, Language Weight etc. shall be conducted to obtain best system accuracy. Confidence measures based upon acoustic features as reported by my colleagues in [4] using Artificial Neural Network model can be incorporated as a post-processing module to select the best recognition hypotheses and boost the accuracy even further. We can try adapting the Marathi, Telugu and Tamil acoustic models [3] (each trained on 20 hours of speech data) available with us to the two-three hours of clean Malayalam speech to see whether accuracy improves.
The goal is to build a strong baseline speaker-independent Automatic Speech Recognition (ASR) system for Malayalam language. Since free multi-speaker speech data is not available for Malayalam language, it may be better to create one at this juncture. The 1000 specially designed Malayalam sentences (from Indic database), Malayalam quotes from web, text from Malayalam story books (if available of web) and sentences from different sections of newspapers such as Business, Health, Education, Entertainment etc which inherently carry diverse topics/themes/flavour, vocabulary, writing style shall be selected as transcriptions and shall also be used to create N-gram Language Model (LM). The text from above diverse sources shall be compiled into unique sets of 10 sentences each, and native Malayalam speakers (students and staff in IIIT-H belonging to different age-groups and gender) shall be asked to read out one or more sentence sets in silent environment (recording studio in IIIT-H) to create 2-3 hours of speech data (like TIMIT) and build a strong baseline system upon which further R&D in ASR for Malayalam language can be carried out in future. We will use automatic generated pronunciation dictionary to start with, and later refine it. Then, experiments to optimize typically tuned training parameters such as number of components in Gaussian mixture, number of tied states, etc, and decoding parameters such as Word Insertion Penalty, Language Weight etc. shall be conducted to obtain best system accuracy. Confidence measures based upon acoustic features as reported by my colleagues in [4] using Artificial Neural Network model can be incorporated as a post-processing module to select the best recognition hypotheses and boost the accuracy even further. We can try adapting the Marathi, Telugu and Tamil acoustic models (each trained on 20 hours of speech data) available with us to the two-three hours of clean Malayalam speech to see whether accuracy improves.


The new Kaldi speech recognition toolkit includes scripts for state-of-the-art technologies such as Subspace Gaussian Mixture Models and Deep Neural Networks. We obtained outstanding results [5] for TIMIT, and if mentor permits, we can try it for Malayalam speech data as well.
The new Kaldi speech recognition toolkit includes scripts for state-of-the-art technologies such as Subspace Gaussian Mixture Models and Deep Neural Networks. We obtained outstanding results [5] for TIMIT, and if mentor permits, we can try it for Malayalam speech data as well.
Line 82: Line 62:
In order that the ASR system for Malayalam doesn’t become stagnant after GSoC and that it is kept updated with real-world data, we can build a small speech recorder and embed it on SMC website or on the project blog. We shall appeal the visitors to SMC website who are fluent in Malayalam to donate speech. Each user coming forward to donate speech shall be shown 10 moderately long sentences and asked to read the sentences one-by-one. This way, similar to Google which stores and learns from each user query, we shall be able to update our ASR system on a regular basis as more data comes. Since these recorded sentences are likely to contain speech disfluencies and non speech events, we shall get good amount of real word data in this manner. The recorded sentences will then be filtered to remove casually spoken, highly noisy utterances and those containing minimal speech to prevent the baseline acoustic models from contamination.
In order that the ASR system for Malayalam doesn’t become stagnant after GSoC and that it is kept updated with real-world data, we can build a small speech recorder and embed it on SMC website or on the project blog. We shall appeal the visitors to SMC website who are fluent in Malayalam to donate speech. Each user coming forward to donate speech shall be shown 10 moderately long sentences and asked to read the sentences one-by-one. This way, similar to Google which stores and learns from each user query, we shall be able to update our ASR system on a regular basis as more data comes. Since these recorded sentences are likely to contain speech disfluencies and non speech events, we shall get good amount of real word data in this manner. The recorded sentences will then be filtered to remove casually spoken, highly noisy utterances and those containing minimal speech to prevent the baseline acoustic models from contamination.


The automatic generated dictionary shall be refined using feedback from acoustic features, confidence scores and confused word pairs. A new ASR system using the refined pronunciation dictionary shall be developed.
The automatic generated dictionary shall be refined using feedback from acoustic features, confidence scores and confused word pairs. A new ASR system using the refined pronunciation dictionary shall be developed and optimized.


   
   
Line 94: Line 74:
     Milestone 1: Before May 21.
     Milestone 1: Before May 21.


a. Transcriptions: We shall use the 1000 Malayalam sentences from Indic database. The sentences were selected to cover 5000 most frequent words in text corpus of Malayalam language. We shall also gather text from other sources such as different sections of news archives such as Entertainment, World, Sports, Business, Health, Science and Technology, etc., Malayalam quotes and text from Malayalam stories on web, to include diversity. This text will be properly compiled into unique sets of 10 sentences each [2] for handing over to readers for recording. The UTF-8 text can be used for recording, and will be converted to IT3 (using a PERL script available with us) for training in Sphinx. This text will also be used for preparing N-gram LM.
a. Transcriptions: We shall use the 1000 Malayalam sentences from Indic database. The sentences were selected to cover 5000 most frequent words in text corpus of Malayalam language. We shall also gather text from different sections of news archives such as Entertainment, World, Sports, Business, Health, Science and Technology, etc., Malayalam quotes and text from Malayalam stories on web, to include diversity. This text will be properly compiled into unique sets of 10 sentences each [2] before handing over to readers for recording. The UTF-8 text can be used for recording, and will be converted to IT3 (using a PERL script available with us) for training in Sphinx. This text in IT3 format will also be used for preparing N-gram LM.


b.Phoneme set: We shall start with treating each phoneme in Malayalam language as a separate class. Later, if required, we may have to merge a few least frequent phonemes to acoustically similar and frequently occurring phonemes in the database to smoothly complete the training process in Sphinx which throws “seno never occurred” error message whenever instances of a phoneme in the database are fewer than number of mixture components in the Gaussian Mixture Model chosen for training.
b.Phoneme set: We shall start with treating each phoneme in Malayalam language as a separate class. Later, if required, we may have to merge a few least frequent phonemes to acoustically similar and frequently occurring phonemes in the database to smoothly complete the training process in Sphinx which throws “seno never occurred” error message whenever instances of a phoneme in the database are fewer than number of mixture components in the Gaussian Mixture Model chosen for training.
Line 107: Line 87:


Since a free multi-speaker speech database is not available for Malayalam language, it may be better to create one at this juncture. Using the 1000 specially designed sentences (prepared during Indic database creation) and also text from news archives, we can plan to record more than two hours of speech data (just like TIMIT) to build a strong baseline system which will help us to lay a strong foundation for further research and development on Malayalam ASR system in future.
Since a free multi-speaker speech database is not available for Malayalam language, it may be better to create one at this juncture. Using the 1000 specially designed sentences (prepared during Indic database creation) and also text from news archives, we can plan to record more than two hours of speech data (just like TIMIT) to build a strong baseline system which will help us to lay a strong foundation for further research and development on Malayalam ASR system in future.


Recording shall be done inside a sound-proof recording studio in IIIT-H building by asking 15-20 native Keralite speakers belonging to different age-groups (students + staff of IIIT-H), gender, and hailing from different places in Kerala to read sentences. A standard headset microphone connected to a Zoom handy recorder shall be used for recording. Reason for using a handy recorder is that, it will be highly mobile and easy to operate. By using a headset, the distance from the microphone to mouth and recording level can be kept constant. Speech recorded using Zoom recorder, will be sampled at 16KHz which we can further down-sample to 8KHz if in future acoustic models are to be used in a telephony voice interface.
Recording shall be done inside a sound-proof recording studio in IIIT-H building by asking 15-20 native Keralite speakers belonging to different age-groups (students + staff of IIIT-H), gender, and hailing from different places in Kerala to read sentences. A standard headset microphone connected to a Zoom handy recorder shall be used for recording. Reason for using a handy recorder is that, it will be highly mobile and easy to operate. By using a headset, the distance from the microphone to mouth and recording level can be kept constant. Speech recorded using Zoom recorder, will be sampled at 16KHz which we can further down-sample to 8KHz if in future acoustic models are to be used in a telephony voice interface.


TIMIT data is used as a benchmark dataset in the speech recognition community and it enables one to build a strong baseline system for English language. A major drawback of our collected data in comparison to TIMIT will be that in the time span available, I will be able to acquire speech data from 15-20 native speakers only. We could outsource this task to Amazon Mechanical Turk, but we may not obtain the quality which I can assure when the recording is done in my presence inside the recording studio. Still, I feel this data to be a great starting point toward building a full-fledged and more robust ASR system in future. We have acoustic models trained on around 20 hours of Tamil, Telugu and Marathi data. We can try adapting these acoustic models to the two hours of clean Malayalam speech data, and see whether accuracy improves.
TIMIT data is used as a benchmark dataset in the speech recognition community and it enables one to build a strong baseline system for English language. A major drawback of our collected data in comparison to TIMIT will be that in the time span available, I will be able to acquire speech data from 15-20 native speakers only. We could outsource this task to Amazon Mechanical Turk, but we may not obtain the quality which I can assure when the recording is done in my presence inside the recording studio. Still, I feel this data to be a great starting point toward building a full-fledged and more robust ASR system in future. We have acoustic models trained on around 20 hours of Tamil, Telugu and Marathi data. We can try adapting these acoustic models to the two hours of clean Malayalam speech data, and see whether accuracy improves.
Line 164: Line 146:
Kaldi is a much newer speech recognition toolkit, but has become popular because of its improved algorithms and support to state-of-the-art technologies such as Subspace Gaussian Mixture Models and Deep Neural Networks. I recently trained ASR on TIMIT database using Kaldi and it gave outstanding results. The KaldiResutsOnTIMIT.txt file is at [5]
Kaldi is a much newer speech recognition toolkit, but has become popular because of its improved algorithms and support to state-of-the-art technologies such as Subspace Gaussian Mixture Models and Deep Neural Networks. I recently trained ASR on TIMIT database using Kaldi and it gave outstanding results. The KaldiResutsOnTIMIT.txt file is at [5]


We can try and see how SGMM and DNN fare for our two hours of Malayalam database.
We can try and see how SGMM and DNN fare for our 2-3 hours of Malayalam database.


   
   
Line 172: Line 154:
     Milestone 6: July 11 - July 25
     Milestone 6: July 11 - July 25


By then, we will be having a functional and an optimized ASR system. We would like to keep extending its capabilities by making it robust to background noise, more speaker-independent, and more intelligent by even appending a Natural Language Understanding module in the long-term future. For that we need to lay a platform through  which more and more real-world data can be acquired and fed to the trainer, just like Google which stores all our queries and keeps learning from it.
By then, we will be having a functional and an optimized ASR system. We would like to keep extending its capabilities by making it robust to background noise, more speaker-independent, and more intelligent by even appending a Natural Language Understanding module in the long-term future. For that we need to create a means through  which more and more real-world data can be acquired and fed to the trainer, just like Google which stores all our queries and keeps learning from it.
 
The idea is, we can embed a small speech recorder app on SMC website or on our project blog, and appeal the visitors who are fluent in Malayalam language to donate speech. This app will show 10 moderately long sentences (not containing more than 8-10 words per sentence in order to prevent tiring the reader) to the visitor and ask him/her to read one-by-one in response to the system’s request. This way we will be able to collect speech from a large number of native Malayalam speakers. The recorded speech is expected to contain speech disfluencies and non-speech events, and hence shall constitute a good representation of real world data. We will monitor sentences uploaded for reading to see that each sentence contains light mood and do not contain any displeasing sentences. We will also give an option to the reader to re-record speech with another set of 10 sentences if (s)he is interested. All recorded utterances will be rated by force aligning the utterances with sentences shown to the user. Sphinx’s force-align tool will give low likelihoods (ratings) for highly noisy, empty, and casually spoken utterances. All such utterances can then be filtered out to prevent baseline acoustic models from contamination.


The idea is, we can embed a small speech recorder app on SMC website or on our project blog, and appeal the visitors who are fluent in Malayalam language to donate speech. This app will show 10 moderately long sentences (not containing more than 8-10 words per sentence in order to prevent reader from getting tired) to the visitor and ask him/her to read one-by-one in response to the system’s request. This way we will be able to collect speech from a large number of native Malayalam speakers, and the recorded speech is expected to contain speech disfluencies and non-speech events, and hence shall constitute a good representation of real world data. We will monitor sentences uploaded for reading to see that each sentence contains light mood and/or is pleasant to read. We will give an option to the reader to re-record speech with another set of 10 sentences if (s)he is interested. All recorded utterances will be rated using the force aligning the utterances against sentences shown to the user. Sphinx’s force-align tool will give low likelihoods (ratings) for highly noisy, empty, and casually spoken utterances. All such utterances shall be filtered out to prevent baseline acoustic models from contamination.


If the appeal to donate speech does not receive expected response, we shall think of building an app which will automatically solicit sentences from the visitors.
If the appeal to donate speech does not receive expected response, we shall think of building an app which will automatically solicit sentences from the visitors.
Line 184: Line 167:
     Milestone 7: July 26 - August 3
     Milestone 7: July 26 - August 3


The utterances collected over the web app will contain many nonspeech sounds. So, it may not be right to feed the utterance  and sentence (shown to the reader which is free of filler labels) pair to the Sphinx trainer. These sentences should therefore be modified to accommodate nonspeech sounds at the proper locations in the transcript before feeding them to the Sphinx trainer. We have acoustic models for a variety of filler sounds as a part of Marathi acoustic models. Since acoustic models are not text but data files, we may have to  understand the data structure of Sphinx acoustic models, and if necessary slightly edit the C code and recompile Sphinx to accommodate filler models from Marathi in Malayalam baseline models.
The utterances collected over the web app will contain many nonspeech sounds. Feeding these utterances to the Sphinx trainer along with sentences shown to the reader (which do not contain nonspeech labels to account for nonspeech sounds) may damage the acoustic models of phones adjacent to nonspeech sounds. These sentences should therefore be modified to accommodate nonspeech labels at the proper locations in the transcript before feeding them to the Sphinx trainer. We have acoustic models for a variety of filler sounds as a part of Marathi acoustic models. Since acoustic models are not text but data files, we may have to  understand the data structure of Sphinx acoustic models, and if necessary slightly edit the C code and recompile Sphinx to export filler models from Marathi acoustic models to Malayalam baseline models.


Once this is accomplished, force-aligning utterance with the sentences will insert filler sounds at proper places in the transcript. Such transcript would then be ready to be used for training.
Once this is accomplished, force-aligning utterance with the sentences will insert filler sounds at proper places in the transcript. Such transcript will then become ready to be used for training.


   
   
Line 194: Line 177:
     Milestone 8: August 4 - August 6
     Milestone 8: August 4 - August 6


The feedback got from the logs, by monitoring the confused word pairs, deleted/inserted words, word lattices, acoustic and confidence scores will help us weed out many errors in pronunciation dictionary. We will then re-build a more consistent system to obtain better performance.
The feedback got from the training/decoding log files, feedback obtained by monitoring the confused word pairs, deleted/inserted words, word lattices, acoustic and confidence scores will help us weed out many errors in pronunciation dictionary introduced by dictionary generating PERL script. We will then re-build a more consistent system to obtain better performance.


   
   
Line 215: Line 198:


Speech is the most natural form of communication but spoken language technology is still relatively naive after six decades of research. Today, we will find ASR and Text-to-Speech systems being developed in English and many European languages, which work really well within the limits of their task. These highly sophisticated and finely tuned systems yielding high accuracies are mostly developed by commercial organizations which charge high amounts for building spoken interfaces. As a result, the technology penetration is very minimal, less attention is being paid by Engineers and Scientists to this field,  and fewer speech recognition systems are being developed for Indian languages today. It is through this medium and honest initiatives of FOSS foundations like SMC, that we can blow up the size of community interested in developing and researching ASR systems and spoken language interfaces, and extend this effort to as many Indian languages as possible.
Speech is the most natural form of communication but spoken language technology is still relatively naive after six decades of research. Today, we will find ASR and Text-to-Speech systems being developed in English and many European languages, which work really well within the limits of their task. These highly sophisticated and finely tuned systems yielding high accuracies are mostly developed by commercial organizations which charge high amounts for building spoken interfaces. As a result, the technology penetration is very minimal, less attention is being paid by Engineers and Scientists to this field,  and fewer speech recognition systems are being developed for Indian languages today. It is through this medium and honest initiatives of FOSS foundations like SMC, that we can blow up the size of community interested in developing and researching ASR systems and spoken language interfaces, and extend this effort to as many Indian languages as possible.


This project is a stepping stone toward a larger and long-term goal. If, with time, we are able to build a robust ASR system for Malayalam, it could be integrated into OS to exercise hands-free computing, could be used to develop good Interactive Voice Responses (IVRs), good machine translation systems, robot systems etc.   
This project is a stepping stone toward a larger and long-term goal. If, with time, we are able to build a robust ASR system for Malayalam, it could be integrated into OS to exercise hands-free computing, could be used to develop good Interactive Voice Responses (IVRs), good machine translation systems, robot systems etc.   
Line 280: Line 264:


[4] Jigar Gada, Preeti Rao and Samudravijaya K,Confidence Measures for Detecting Speech Recognition Errors, Proc. of National Conference on Communications, 15-17 February,  2013, New Delhi.
[4] Jigar Gada, Preeti Rao and Samudravijaya K,Confidence Measures for Detecting Speech Recognition Errors, Proc. of National Conference on Communications, 15-17 February,  2013, New Delhi.


[5] Results of SGMM and DNN on TIMIT (suing Kaldi toolkit)  https://drive.google.com/file/d/0B248SZQblK2nZVJDZ0p5ME1RS00/edit?usp=sharing
[5] Results of SGMM and DNN on TIMIT (suing Kaldi toolkit)  https://drive.google.com/file/d/0B248SZQblK2nZVJDZ0p5ME1RS00/edit?usp=sharing

Revision as of 03:23, 25 March 2014

Organization: Swathanthra Malayalam Computing Short description: The goal is to prepare text corpus, collect speech data from many native Malayalam speakers and build a strong baseline speaker-independent ASR system which lays a solid foundation for further R&D in ASR and other spoken language technologies in Malayalam language, in future. Sphinx, Kaldi (with DNN, SGMM) can help us build ASR in quick time. We can also embed a speech recording app on SMC website or project blog to create a means of regular in-flow of real-world data to keep ASR updated.

Email Address: tejas.godambe@gmail.com

Telephone: 09821518774

Freenode IRC Nick: tejasg

Your university and current education: Pursuing MS by Research on the topic "Devising novel confidence measures for cross-lingual speech recognition" with Dr. Kishore Prahallad at International Institute of Information Technology, Hyderabad, India


Why do you want to work with the Swathanthra Malayalam Computing? I wish to work with SMC for the following reasons: 1) SMC is a volunteering organization with an end goal of putting Indian languages on the map of graphical and spoken interfaces. This endeavor has several positive outcomes such as,

   a. The middle-to-old-aged people (including my parents) who are scared of operating
      computers will be able to access and use computers confidently and efficiently.
      Technology will not remain "only-of-youth"
   b. Language barriers will weaken. Vast majority of population will be able to reap benefits of
      technology.
   c. It will encourage people to speak and write in their respective mother tongues. If this
      initiative scales up with time it will, in the long run, help avoid endangering of regional
      languages and also help preserve cultural heritage of India.

2) I support the free and open source software initiative, but hadn’t been involved with any project until yet. Because of my past experience in building ASR systems, I feel confident to face challenges that encounter while building an efficient ASR system in Malayalam language for SMC, and it would be a great opportunity for me to engage, interact and learn the ethics and etiquettes practiced in open source community.


Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor? No.


Did you participate with the past GSoC programs, if so which years, which organizations? No.


Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment No. I don't have any obligations between May and August. My advisor has permitted me to work full-time for GSoC '14.


Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2014 program, if yes, which area(s), you are interested in? Yes. I would like to remain engaged with SMC team after the GSoC project. My current inclination is toward ASR and text-to-speech system development and research.


Why should we choose you over other applicants? 1) I humbly believe that I will be able to face challenges while building ASR system for Malayalam language and quickly devise solutions whenever something fails. This confidence is just a product of my past experience of two years at TIFR, Mumbai and a year at IIIT-H, where I have been building and optimizing Marathi and Telugu ASR systems, and also performing cross-lingual experiments using Sphinx. It was in this process after building and debugging ASR systems several times that I became aware of the many nitty-gritty details of Sphinx. In the beginning, when I was new to Sphinx, I was overwhelmed by its vastness. But, with time I began to understand its internal workings. The links http://www.speech.cs.cmu.edu/sphinxman/, http://asr.cs.cmu.edu/ and speech recognition forum at sourceforge helped me a great deal to understand many intricacies of Sphinx and ASR in general.


2) Even if Sphinx builds statistical (data-driven) acoustic models which enables a non-native and non-speech researcher to build ASR system in hours, many times things such as debugging the actual error when error messages do not directly point at the source of error, proper experiment-planning and decision making, having a feel and knowing what may work and what may not, which thing may may take how much time and is it worth the time/effort, come with experience and from good knowledge of the theory. I feel that my experience at TIFR, Mumbai, speech processing courses credited at IIIT-H, things learnt by idling, asking questions on CMU Sphinx forum at sourceforge and those learnt from discussions with my fellow speech lab members working on different aspects of speech processing, and above all my liking toward speech technology and Human Computer Interaction give me an edge.


An overview of your proposal The goal is to build a strong baseline speaker-independent Automatic Speech Recognition (ASR) system for Malayalam language. Since free multi-speaker speech data is not available for Malayalam language, it may be better to create one at this juncture. The 1000 specially designed Malayalam sentences (from Indic database), Malayalam quotes from web, text from Malayalam story books (if available of web) and sentences from different sections of newspapers such as Business, Health, Education, Entertainment etc which inherently carry diverse topics/themes/flavour, vocabulary, writing style shall be selected as transcriptions and shall also be used to create N-gram Language Model (LM). The text from above diverse sources shall be compiled into unique sets of 10 sentences each, and native Malayalam speakers (students and staff in IIIT-H belonging to different age-groups and gender) shall be asked to read out one or more sentence sets in silent environment (recording studio in IIIT-H) to create 2-3 hours of speech data (similar to TIMIT) and build a strong baseline system upon which further R&D in ASR for Malayalam language can be carried out in future. We will use automatically generated pronunciation dictionary to start with, and later refine it. Then, experiments to optimize typically tuned training parameters such as number of components in Gaussian mixture, number of tied states, etc, and decoding parameters such as Word Insertion Penalty, Language Weight etc. shall be conducted to obtain best system accuracy. Confidence measures based upon acoustic features as reported by my colleagues in [4] using Artificial Neural Network model can be incorporated as a post-processing module to select the best recognition hypotheses and boost the accuracy even further. We can try adapting the Marathi, Telugu and Tamil acoustic models [3] (each trained on 20 hours of speech data) available with us to the two-three hours of clean Malayalam speech to see whether accuracy improves.

The new Kaldi speech recognition toolkit includes scripts for state-of-the-art technologies such as Subspace Gaussian Mixture Models and Deep Neural Networks. We obtained outstanding results [5] for TIMIT, and if mentor permits, we can try it for Malayalam speech data as well.

In order that the ASR system for Malayalam doesn’t become stagnant after GSoC and that it is kept updated with real-world data, we can build a small speech recorder and embed it on SMC website or on the project blog. We shall appeal the visitors to SMC website who are fluent in Malayalam to donate speech. Each user coming forward to donate speech shall be shown 10 moderately long sentences and asked to read the sentences one-by-one. This way, similar to Google which stores and learns from each user query, we shall be able to update our ASR system on a regular basis as more data comes. Since these recorded sentences are likely to contain speech disfluencies and non speech events, we shall get good amount of real word data in this manner. The recorded sentences will then be filtered to remove casually spoken, highly noisy utterances and those containing minimal speech to prevent the baseline acoustic models from contamination.

The automatic generated dictionary shall be refined using feedback from acoustic features, confidence scores and confused word pairs. A new ASR system using the refined pronunciation dictionary shall be developed and optimized.


How you intend to implement your proposal, and a rough timeline for your progress with phases

The action plan along with timeline is as follows:

1) Make transcripts, phoneme set and pronunciation dictionary ready:

   Milestone 1: Before May 21.

a. Transcriptions: We shall use the 1000 Malayalam sentences from Indic database. The sentences were selected to cover 5000 most frequent words in text corpus of Malayalam language. We shall also gather text from different sections of news archives such as Entertainment, World, Sports, Business, Health, Science and Technology, etc., Malayalam quotes and text from Malayalam stories on web, to include diversity. This text will be properly compiled into unique sets of 10 sentences each [2] before handing over to readers for recording. The UTF-8 text can be used for recording, and will be converted to IT3 (using a PERL script available with us) for training in Sphinx. This text in IT3 format will also be used for preparing N-gram LM.

b.Phoneme set: We shall start with treating each phoneme in Malayalam language as a separate class. Later, if required, we may have to merge a few least frequent phonemes to acoustically similar and frequently occurring phonemes in the database to smoothly complete the training process in Sphinx which throws “seno never occurred” error message whenever instances of a phoneme in the database are fewer than number of mixture components in the Gaussian Mixture Model chosen for training.

c. Pronunciation dictionary:Pronunciation dictionary preparation is an as time consuming and tedious job as speech data collection and transcription preparation. It requires a language expert and the job is not plainly to create a dictionary by entering canonical pronunciation of each word, but, rather enter phoneme sequences of words as they are pronounced in the database. So, we shall use a parser script written in PERL by my fellow colleagues at IIIT-H while building TTS system using Indic database. The dictionary generated using parser script will not be 100 percent accurate, but we shall refine the pronunciations in dictionary with time using the feedback from ASR system’s errors, confused word pairs, acoustic scores etc.


2) Record speech data

   Milestone 2 :  May 21 - June 10

Since a free multi-speaker speech database is not available for Malayalam language, it may be better to create one at this juncture. Using the 1000 specially designed sentences (prepared during Indic database creation) and also text from news archives, we can plan to record more than two hours of speech data (just like TIMIT) to build a strong baseline system which will help us to lay a strong foundation for further research and development on Malayalam ASR system in future.


Recording shall be done inside a sound-proof recording studio in IIIT-H building by asking 15-20 native Keralite speakers belonging to different age-groups (students + staff of IIIT-H), gender, and hailing from different places in Kerala to read sentences. A standard headset microphone connected to a Zoom handy recorder shall be used for recording. Reason for using a handy recorder is that, it will be highly mobile and easy to operate. By using a headset, the distance from the microphone to mouth and recording level can be kept constant. Speech recorded using Zoom recorder, will be sampled at 16KHz which we can further down-sample to 8KHz if in future acoustic models are to be used in a telephony voice interface.


TIMIT data is used as a benchmark dataset in the speech recognition community and it enables one to build a strong baseline system for English language. A major drawback of our collected data in comparison to TIMIT will be that in the time span available, I will be able to acquire speech data from 15-20 native speakers only. We could outsource this task to Amazon Mechanical Turk, but we may not obtain the quality which I can assure when the recording is done in my presence inside the recording studio. Still, I feel this data to be a great starting point toward building a full-fledged and more robust ASR system in future. We have acoustic models trained on around 20 hours of Tamil, Telugu and Marathi data. We can try adapting these acoustic models to the two hours of clean Malayalam speech data, and see whether accuracy improves.


3) Build and optimize ASR system parameters:

  Milestone 3: June 11 - June 14

Develop a baseline ASR system by running the RunAll.pl script provided by Sphinx. RunAll.pl (as described in detail on page 3 in [4]) executes the following steps:

1. Feature extraction

2. Trains Context Independent (CI) HMMs

3. Force-aligns transcripts using CI HMMs

4. Trains Context Dependent (CD) untied HMMs

5. Trains CD tied states

6. Force-aligns transcripts using CD models

7. Trains Language Model (using CMU-Cambridge Statistical LM Toolkit)

8. Decodes utterances

9. Evaluates

Then, we will adjust and optimise the typically tuned parameters such as number of tied states, number of components in Gaussian mixture model, and decoder parameters such as Word Insertion Penalty (WIP), Language Weight etc. to achieve the best accuracy.


4) Build an ANN model in post-processing stage for outputting high confidence hypothesis

    Milestone 4: June 15 - June 30

Confidence measures based upon acoustic features using Artificial Neural Network model [4] can be incorporated as a post-processing module to select the best recognition hypotheses and boost the accuracy even further


5) Mid-term evaluations:

   June 23 - June 27

Complete the mid-term evaluations in this period. Also, utilize this period to finish the minor pending works and fixing minor bugs.


6) Try SGMM and DNN techniques on Malayalam data using Kaldi toolkit

   Milestone 5: July 1 - July 10

Kaldi is a much newer speech recognition toolkit, but has become popular because of its improved algorithms and support to state-of-the-art technologies such as Subspace Gaussian Mixture Models and Deep Neural Networks. I recently trained ASR on TIMIT database using Kaldi and it gave outstanding results. The KaldiResutsOnTIMIT.txt file is at [5]

We can try and see how SGMM and DNN fare for our 2-3 hours of Malayalam database.


7) Embed speech recording application on SMC website or blog

   Milestone 6: July 11 - July 25

By then, we will be having a functional and an optimized ASR system. We would like to keep extending its capabilities by making it robust to background noise, more speaker-independent, and more intelligent by even appending a Natural Language Understanding module in the long-term future. For that we need to create a means through which more and more real-world data can be acquired and fed to the trainer, just like Google which stores all our queries and keeps learning from it.

The idea is, we can embed a small speech recorder app on SMC website or on our project blog, and appeal the visitors who are fluent in Malayalam language to donate speech. This app will show 10 moderately long sentences (not containing more than 8-10 words per sentence in order to prevent tiring the reader) to the visitor and ask him/her to read one-by-one in response to the system’s request. This way we will be able to collect speech from a large number of native Malayalam speakers. The recorded speech is expected to contain speech disfluencies and non-speech events, and hence shall constitute a good representation of real world data. We will monitor sentences uploaded for reading to see that each sentence contains light mood and do not contain any displeasing sentences. We will also give an option to the reader to re-record speech with another set of 10 sentences if (s)he is interested. All recorded utterances will be rated by force aligning the utterances with sentences shown to the user. Sphinx’s force-align tool will give low likelihoods (ratings) for highly noisy, empty, and casually spoken utterances. All such utterances can then be filtered out to prevent baseline acoustic models from contamination.


If the appeal to donate speech does not receive expected response, we shall think of building an app which will automatically solicit sentences from the visitors.


8) Add acoustic models of filler sounds to baseline acoustic models

   Milestone 7: July 26 - August 3

The utterances collected over the web app will contain many nonspeech sounds. Feeding these utterances to the Sphinx trainer along with sentences shown to the reader (which do not contain nonspeech labels to account for nonspeech sounds) may damage the acoustic models of phones adjacent to nonspeech sounds. These sentences should therefore be modified to accommodate nonspeech labels at the proper locations in the transcript before feeding them to the Sphinx trainer. We have acoustic models for a variety of filler sounds as a part of Marathi acoustic models. Since acoustic models are not text but data files, we may have to understand the data structure of Sphinx acoustic models, and if necessary slightly edit the C code and recompile Sphinx to export filler models from Marathi acoustic models to Malayalam baseline models.

Once this is accomplished, force-aligning utterance with the sentences will insert filler sounds at proper places in the transcript. Such transcript will then become ready to be used for training.


9) Refine the auto generated pronunciation dictionary and rebuild system

   Milestone 8: August 4 - August 6

The feedback got from the training/decoding log files, feedback obtained by monitoring the confused word pairs, deleted/inserted words, word lattices, acoustic and confidence scores will help us weed out many errors in pronunciation dictionary introduced by dictionary generating PERL script. We will then re-build a more consistent system to obtain better performance.


10) Allotted time for unexpected delays and pending works

     August 7- August 10


11) Pencils down period starts

     August 11- August 18
     Take time to scrub code, write tests and improve documentation.


The need you believe it fulfills

Speech is the most natural form of communication but spoken language technology is still relatively naive after six decades of research. Today, we will find ASR and Text-to-Speech systems being developed in English and many European languages, which work really well within the limits of their task. These highly sophisticated and finely tuned systems yielding high accuracies are mostly developed by commercial organizations which charge high amounts for building spoken interfaces. As a result, the technology penetration is very minimal, less attention is being paid by Engineers and Scientists to this field, and fewer speech recognition systems are being developed for Indian languages today. It is through this medium and honest initiatives of FOSS foundations like SMC, that we can blow up the size of community interested in developing and researching ASR systems and spoken language interfaces, and extend this effort to as many Indian languages as possible.


This project is a stepping stone toward a larger and long-term goal. If, with time, we are able to build a robust ASR system for Malayalam, it could be integrated into OS to exercise hands-free computing, could be used to develop good Interactive Voice Responses (IVRs), good machine translation systems, robot systems etc.


Any relevant experience you have

Worked as a Research Fellow with Dr. Samudravijaya K at Tata Institute of Fundamental Research, Mumbai from September 2010 to August 2012 on the project "Speech-based Access for Agricultural Commodity Prices in Six Indian Languages (DIT) 2010-2012". During that time we developed a voice interface to Agmarknet website (http://agmarknet.nic.in/) in Marathi language for Maharashtra's farmers, so as to break the hurdles such as unavailability of computers, internet in villages, and lack of knowledge of operating computers, internet and that of English language among farmers.

Among the many tasks in the project, I lead few tasks while just participating and helping in other tasks. The whole project work included designing data collection process [1][2] for collecting speech data from 1500 farmers across 34 districts of Maharashtra, building data collection system over telephone channel using Computer Telephony Interface (CTI) card and Asterisk software, designing and creating software for data validation, building and optimising the parameters of Marathi ASR system [3], porting ASR system to telephone channel for online ASR system evaluation, and employing confidence measures using Artificial Neural Network model to improve accuracy and consequently minimise transaction time and maximize the transaction rate [4]. We also tried building a keyword spotting system to spot commodity names (keyword of interest) from the stream of unconstrained input speech.

The voice interfaces to six Indian languages can be accessed by dialling the numbers provided at http://asrmandi.wix.com/asrmandi


Any other details you feel we should consider

1) Fluent with CMU Sphinx, Kaldi toolkits and scripting/programming languages such as CSH, BASH, PERL, C, MATLAB.

2) Semi-fluent with C++, Python

3) Current CGPA at IIIT-H, 8.83.


Tell us about something you have created

As mentioned above, me and my team together created a voice interface to http://agmarknet.nic.in/ in Marathi language for Maharashtra’s farmers.

Apart from this, as part of the coursework at IIIT-H, we were required to develop some fun systems listed below:

1) Spoken keyword detection using Segmental Dynamic Time Warping (SDTW)

   https://drive.google.com/file/d/0B248SZQblK2nVWY0WGpvYzBjRlE/edit?usp=sharing

2) Text normalization and morphophonemic processing for Marathi TTS system.

   https://drive.google.com/file/d/0B248SZQblK2nM0x1Sko3V0NOa3M/edit?usp=sharing

3) Achieving image compression using Principal Component Analysis technique.


Have you communicated with a potential mentor? If so, who?

I haven't communicated with Deepa Gopinath Madam as yet. But I had subscribed the student mailing list, and it helped a great deal in gathering ideas and understanding the needs of the project.


SMC Wiki link of your proposal

http://wiki.smc.org.in/User:Tejas_godambe


References

[1] Tejas Godambe and Samudravijaya K,Speech Data Acquisition for Voice based Agricultural Information Retrieval, presented at the 39th All India DLA Conference, Punjabi University, Patiala, 14-16th June 2011

[2] Tejas Godambe, Nandini Bondale, Samudravijaya K and Preeti Rao,Multi-speaker, Narrowband, Continuous Marathi Speech Database, Proc. of Oriental COCOSDA 2013,November 25-27, 2013, New Delhi.


[3] Tejas Godambe, Namrata Karkera and Samudravijaya K,Adaptation of Acoustic Models for Improved Marathi Speech RecognitionProc. of International conference "ACOUSTICS 2013", New Delhi, 10-15 November 2013.

[4] Jigar Gada, Preeti Rao and Samudravijaya K,Confidence Measures for Detecting Speech Recognition Errors, Proc. of National Conference on Communications, 15-17 February, 2013, New Delhi.

[5] Results of SGMM and DNN on TIMIT (suing Kaldi toolkit) https://drive.google.com/file/d/0B248SZQblK2nZVJDZ0p5ME1RS00/edit?usp=sharing