Anonymous

Changes

From SMC Wiki

GSoC/2013/Project ideas

675 bytes added, 04:42, 15 April 2013
SILPA project has a spellchecker written using python with a not so simple algorithm. But still it is not capable of handling inflection and agglutination occurring in Indian languages especially south Indian languages. The dictionary we have for Malayalam spellchecker have about 150000 words. Of course we can expand the dictionary, but that doesn't have much value since words can be formed in Malayalam or Tamil etc by joining multiple words. In addition to that, words get inflected based on grammar forms(sandhi), plural, gender etc. Hunspell has a system to handle this, but so far nobody succeeded in getting it working for multi level suffix stripping as required for Malayalam. Some times a Malayalam word can be formed by more than 5 words joining together. We will need a word splitting logic or a table taking care of all patterns. The project is to attempt solving this with hunspell. If that is not feasible(hunspell upstream is not active), develop an algorithm and implement it.
 
Recently Tamil attempted developing a spellchecker using Hunspell with multi level suffix stripping. You can see the result here https://github.com/thamizha/solthiruthi.
Our attempt should be first to use Hunspell to achieve spellchecking with agglutination and inflection. Probably it will require lot of scripting to generate suffix patterns, we can ask help from existing language communities too. If Hunspell has limitation with multi level suffxes- sometimes Indian languages require more than 5 levels of suffix stripping, we need to document it(bug and documentation) and try to attempt python based solution on top of SILPA framework.
* '''[https://savannah.nongnu.org/task/index.php?12558 Savannah Task]'''
* '''Expertise required''': Basic Average level understanding of grammar system of at least one Indian language* Complexity: Advanced
* '''Mentor''' : Santhosh Thottingal