GSoC/2013/Project ideas

5 bytes added, 12:55, 1 April 2013
SILPA project has a spellchecker written using python with a not so simple algorithm. But still it is not capable of handling inflection and agglutination happening occurring in Indian languages especially south indian Indian languages. The dictionary we have for Malayalam spellchecker is having have about 150000 words. Of course we can expand the dictionary, but that has no doesn't have much value since words can be formed in Malayalam or Tamil etc by joining multiple words. In addition to that, words get inflected based on grammar forms(sandhi), plural, gender etc. Hunspell has a system to handle this, but so far nobody succeeded in getting it working for multi level suffix stripping as required for Malayalam. Some times a malayalam Malayalam word can be formed by more than 5 words joining together. We will need a word splitting logic or a table taking care of all patterns. The project is to attempt solving this inside with hunspell. If that is not feasible(hunspell upstream is not active), develop and an algorithm and implement it.
'''Expertise required''': Basic understanding of grammar system of atleast at least one Indian language
'''Mentor''' : Santhosh Thottingal