Malayalam Spell-checker

Problem

English dictionaries "rely on complete lists of full word forms, a requirement that cannot be met for morphologically complex languages" like Malayalam. Theoretically, In Malayalam agglutination of unlimited words can happen. Generally less than 10. Handling agglutinations and inflections in a spell-checker can be challenging.

Refer http://thottingal.in/documents/MalayalamComputingChallenges.pdf

Other Challenges

Homophonic root words can have difference inflections
മറക്കുക & മറയുക; പറയുക & പറക്കുക
Same word can inflect differently in same context (not common)
പോവുക, പോകുക
Sandhi rules are complex.

Possible solutions

Hunspell

Hunspell has an algorithm for figuring out agglutination. Need to figure out how to use it.

Implementation in other languages

Spell Checking an Agglutinative Language: Quechua http://www.zora.uzh.ch/52921/1/ltc-106-rios.pdf Quechua, doesn't seem to have the complexity that malayalam sandhi's have. The automaton presented in the paper doesn't seem to work on malayalam.

kachichasqa= kachi + cha +sqa

http://www.cmpe.boun.edu.tr/~akin/papers/spelling_checking_in_Turkish.pdf

http://arxiv.org/pdf/cmp-lg/9410004.pdf

Stemmer: For finding root words

http://www.ldcil.org/up/conferences/morph/presentations/Vijay%20[Compatibility%20Mode].pdf http://www.cse.iitb.ac.in/~pb/papers/cicling12-stemming.pdf

Lttoolbox

Lttoolbox from apertium package can be used to tokenize and lemmatize compounds/agglutination/inflections.

 <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
 <sdefs>

 </sdefs>

<e>

<l/> <r> ~~</r>~~

~~</e> <e>~~

~~<l>yi</l> <r>~~</r>~~~~

~~</e> <e>~~

~~<l>ya</l> <r>~~</r>~~~~

~~</e>~~

</pardef>

<pardef n="athu_n">

~~<e>~~

~~<l>athu</l> <r>athu~~</r>~~~~

~~</e> <e>~~

~~<l>athu</l> <r>athu~~</r>~~~~

~~</e> <e>~~

~~<l>thu</l> <r>athu~~</r>~~~~

~~</e>~~

</pardef>
<pardef n="kond">

~~<e>~~

~~<l/><r>~~</r>~~~~

~~</e>~~

~~</pardef>~~

</pardefs>
<section id="main" type="standard"> <e lm="povuka"> po <par n="poyi_v"/> </e> <e lm="athu"> <par n="athu_n"/> </e> <e lm="kond"> kond <par n="kond"/> </e> </section>
</dictionary></source> The above given code returned this result:
Input: poyathukond poyaathukond poyathu athukond thu thukond
Output: ^poyathukond/po+athu<noun>+kond<noun>$ ^poyaathukond/po+athu<noun>+kond<noun>$ ^poyathu/po+athu<noun>/po+athu<noun>$ ^athukond/athu<noun>+kond<noun>$ ^thu/athu<noun>$ ^thukond/athu<noun>+kond<noun>$
As can be seen, the program resolves even wrongly spelled compounds. The current lttoolbox markup doesn't give more control on that. The compound need to be regenerated from the stems and matched with the input to check the spelling.

Anonymous

Search

User:Jaseem/spellcheck

Namespaces

More

Page actions

Contents

Malayalam Spell-checker

Problem

Other Challenges

Possible solutions

Hunspell

Implementation in other languages

Lttoolbox

Navigation

Navigation

പ്രധാന കണ്ണികള്‍

പ്രാദേശികവത്കരണം

നിവേശകരീതികള്‍

സംഭാഷണോപാധികള്‍

ഉപകരണങ്ങള്‍

കല

പ്രസിദ്ധീകരണം

Wiki tools

Wiki tools

Anonymous

Search

User:Jaseem/spellcheck

Malayalam Spell-checker

Problem

Other Challenges

Possible solutions

Hunspell

Implementation in other languages

Lttoolbox

Navigation

Wiki tools

Page tools