GSoC/2014/Project ideas
This Page is under development
Ideas for Google Summer of Code 2014
- Please Read the FAQ
Apart from the following ideas , you can propose your own ideas
If you want to propose an idea, please do it in project mailing list>
A spell checker for Indic language that understands inflections
Project:
SILPA project has a spellchecker written using python with a not so simple algorithm. But still it is not capable of handling inflection and agglutination occurring in Indian languages especially south Indian languages. The dictionary we have for Malayalam spellchecker have about 150000 words. Of course we can expand the dictionary, but that doesn't have much value since words can be formed in Malayalam or Tamil etc by joining multiple words. In addition to that, words get inflected based on grammar forms(sandhi), plural, gender etc. Hunspell has a system to handle this, but so far nobody succeeded in getting it working for multi level suffix stripping as required for Malayalam. Some times a Malayalam word can be formed by more than 5 words joining together. We will need a word splitting logic or a table taking care of all patterns. The project is to attempt solving this with hunspell. If that is not feasible(hunspell upstream is not active), develop an algorithm and implement it.
Recently Tamil attempted developing a spellchecker using Hunspell with multi level suffix stripping. You can see the result here https://github.com/thamizha/solthiruthi. Our attempt should be first to use Hunspell to achieve spellchecking with agglutination and inflection. Probably it will require lot of scripting to generate suffix patterns, we can ask help from existing language communities too. If Hunspell has limitation with multi level suffxes- sometimes Indian languages require more than 5 levels of suffix stripping, we need to document it(bug and documentation) and try to attempt python based solution on top of SILPA framework.
- Savannah Task
- Expertise required: Average level understanding of grammar system of at least one Indian language
- Complexity: Advanced
- Mentor : Santhosh Thottingal
Indic rendering support in ConTeXt
Project:
ConTeXt is another TeX macro system similar to LaTeX but much more suitable for design. To find more information about ConTeXt, see the wiki http://wiki.contextgarden.net/Main_Page. ConTeXt MKII have Indic language rendering support using XeTeX. but MKII is deprecated, and the new MKIV backend doesn't support Indic rendering yet. The aim of this project is to add support to Inidic rendering to ConTeXt MKIV. XeTeX is using Harfbuzz to do correct Indic rendering.
- Savannah Task
- Expertise required: Understanding of the TeX system, experience in either LaTeX or ConTeXt and basic understanding of Indic language rendering. MKIV uses Lua, familiarity with Lua, opentype specifications or Harfbuzz will be added advantage.
- Mentor : Rajeesh K Nambiar
- Complexity : Advanced
- More Details: A partially working patch by Rajeesh for MKIV lua code is available. ConTeXt mkii (deprecated) can work with XeTeX backend for Indic rendering. Here is a sample file:
\usemodule[simplefonts] \definefontfeature[malayalam][script=mlym] \setmainfont[Rachana][features=malayalam] \starttext മലയാളം \TeX ഉപയോഗിച്ച് ടൈപ്പ്സെറ്റ് ചെയ്തത് \stoptext
Generate the output using command
texexec --xetex <file.tex>
Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx
CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language. Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.
Mentor: Deepa P. Gopinath
Background Reading
- 'Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems', Gopalakrishna Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V. Sitaram, S P Kishore
- "Automatic Pronunciation Evaluation And Mispronunciation Detection Using CMUSphinx", Ronanki Srikanth, James Salsman
- "HTK Based Telugu Speech Recognition", P. Vijai Bhaskar, AVNIET ,Hyderabad, Prof. Dr. S. Rama Mohan Rao, A.Gopi
- "Design and Development of an Automatic Speech Recognition System for Urdu", Agha Ali Raza, M.Sc. Thesis, FAST‐National University of Computer and Emerging Sciences
- "Investigation Arabic Speech Recognition Using CMU Sphinx System", Hassan Satori1, 2, Hussein Hiyassat3, Mostafa Harti1, 2, and Noureddine Chenfour
- "Understanding the CMU Sphinx Speech Recognition System", Chun-Feng Liao
Language filter for diaspora
Diaspora is a Free Software, federated social networking platform. Diaspora users post in many languages. When people use more than one language in their posts, it is inconvenient for people who don't understand a language. This task is to tag every post with languages used in the post, ideally detected automatically, but with an option to override it. Once each post has a language tag, people should be able to choose their preferred language and posts in other languages should be hidden by default. Also provide an option to translate posts and comments.
- Expertise required: Ruby on Rails
- Mentor: Pirate Praveen, Ershad K
Varnam Based
Improvements to the REST API
This includes rewrite of the current implementation in `golang` and add support for WebSockets to improve the input experience. This also includes making scripts that would ease embedding input on any webpage. All the changes done will go live on[1]
Improve the learning system
The main goal of this is to improve how varnam tokenizes when learning words. Today, when a word is learned, varnam takes all the possible prefixes into account and learn all of them to improve future suggestions. But sometimes, this is not enough to predict good suggestions. An improvement is suggested which will try to infer the base form of the word under learning
Word corpus synchronization
Create a cross-platform synchronization tool which can upload/download the word corpus from offline IMEs like varnam-ibus[2]. This helps to build the online words corpus easily.
[1]: http://www.varnamproject.com [2]: https://gitorious.org/varnamproject/libvarnam-ibus/
Mentor: Navaneeth K N