User:Jaseem/spellcheck: Difference between revisions
mNo edit summary |
m (→Lttoolbox) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 27: | Line 27: | ||
http://arxiv.org/pdf/cmp-lg/9410004.pdf | http://arxiv.org/pdf/cmp-lg/9410004.pdf | ||
;Stemmer: For finding root words | |||
http://www.ldcil.org/up/conferences/morph/presentations/Vijay%20[Compatibility%20Mode].pdf | |||
http://www.cse.iitb.ac.in/~pb/papers/cicling12-stemming.pdf | |||
===Lttoolbox=== | |||
Lttoolbox from apertium package can be used to tokenize and lemmatize compounds/agglutination/inflections. | |||
<source lang="xml"><dictionary> | |||
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> | |||
<sdefs> | |||
<sdef n="noun" /> | |||
<sdef n="s"/> | |||
<sdef n="pl"/> | |||
<sdef n="root"/> | |||
<sdef n="past"/> | |||
<sdef n="verb"/> | |||
<sdef n="compound-only-L" c="May only be the left-side of a compound"/> | |||
<sdef n="compound-R" c="May be the right-side of a compound, or a full word"/> | |||
</sdefs> | |||
<pardefs> | |||
<pardef n="poyi_v"> | |||
<e><p> <l/> <r><s n="verb"/> </r> </p></e> | |||
<e><p> <l>yi</l> <r><s n="verb"/><s n="past"/></r></p></e> | |||
<e><p> <l>ya</l> <r><s n="compound-only-L"/></r></p></e> | |||
</pardef> | |||
<pardef n="athu_n"> | |||
<e><p> <l>athu</l> <r>athu<s n="noun"/></r> </p></e> | |||
<e><p> <l>athu</l> <r>athu<s n="noun"/><s n="compound-only-L"/></r></p></e> | |||
<e><p> <l>thu</l> <r>athu<s n="noun"/><s n="compound-R"/><s n="compound-only-L"/></r></p></e> | |||
</pardef> | |||
<pardef n="kond"> | |||
<e><p><l/><r><s n="noun"/><s n="compound-R"/></r></p></e> | |||
</pardef> | |||
</pardefs> | |||
<section id="main" type="standard"> | |||
<e lm="povuka"> | |||
<i>po</i> | |||
<par n="poyi_v"/> | |||
</e> | |||
<e lm="athu"> | |||
<i></i> | |||
<par n="athu_n"/> | |||
</e> | |||
<e lm="kond"> | |||
<i>kond</i> | |||
<par n="kond"/> | |||
</e> | |||
</section> | |||
</dictionary></source> | |||
The above given code returned this result: | |||
<code> | |||
'''Input:'''<br/> | |||
poyathukond<br/> | |||
poyaathukond<br/> | |||
poyathu<br/> | |||
athukond<br/> | |||
thu<br/> | |||
thukond<br/> | |||
'''Output:'''<br/> | |||
^poyathukond/po+athu<noun>+kond<noun>$<br/> | |||
^poyaathukond/po+athu<noun>+kond<noun>$<br/> | |||
^poyathu/po+athu<noun>/po+athu<noun>$<br/> | |||
^athukond/athu<noun>+kond<noun>$<br/> | |||
^thu/athu<noun>$<br/> | |||
^thukond/athu<noun>+kond<noun>$<br/> | |||
</code> | |||
As can be seen, the program resolves even wrongly spelled compounds. The current lttoolbox markup doesn't give more control on that. The compound need to be regenerated from the stems and matched with the input to check the spelling. |
Latest revision as of 23:07, 11 March 2014
Malayalam Spell-checker
Problem
English dictionaries "rely on complete lists of full word forms, a requirement that cannot be met for morphologically complex languages" like Malayalam. Theoretically, In Malayalam agglutination of unlimited words can happen. Generally less than 10. Handling agglutinations and inflections in a spell-checker can be challenging.
Refer http://thottingal.in/documents/MalayalamComputingChallenges.pdf
Other Challenges
- Homophonic root words can have difference inflections
- മറക്കുക & മറയുക; പറയുക & പറക്കുക
- Same word can inflect differently in same context (not common)
- പോവുക, പോകുക
- Sandhi rules are complex.
Possible solutions
Hunspell
Hunspell has an algorithm for figuring out agglutination. Need to figure out how to use it.
Implementation in other languages
Spell Checking an Agglutinative Language: Quechua http://www.zora.uzh.ch/52921/1/ltc-106-rios.pdf Quechua, doesn't seem to have the complexity that malayalam sandhi's have. The automaton presented in the paper doesn't seem to work on malayalam.
- kachichasqa= kachi + cha +sqa
http://www.cmpe.boun.edu.tr/~akin/papers/spelling_checking_in_Turkish.pdf
http://arxiv.org/pdf/cmp-lg/9410004.pdf
- Stemmer
- For finding root words
http://www.ldcil.org/up/conferences/morph/presentations/Vijay%20[Compatibility%20Mode].pdf http://www.cse.iitb.ac.in/~pb/papers/cicling12-stemming.pdf
Lttoolbox
Lttoolbox from apertium package can be used to tokenize and lemmatize compounds/agglutination/inflections.
<source lang="xml"><dictionary>
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs>
<sdef n="noun" /> <sdef n="s"/> <sdef n="pl"/> <sdef n="root"/> <sdef n="past"/> <sdef n="verb"/> <sdef n="compound-only-L" c="May only be the left-side of a compound"/> <sdef n="compound-R" c="May be the right-side of a compound, or a full word"/>
</sdefs>
<pardefs>
<pardef n="poyi_v">
<e>
<l/> <r> </r>
</e>
<e>
<l>yi</l> <r></r>
</e>
<e>
<l>ya</l> <r></r>
</e>
</pardef>
<pardef n="athu_n">
<e>
<l>athu</l> <r>athu</r>
</e>
<e>
<l>athu</l> <r>athu</r>
</e>
<e>
<l>thu</l> <r>athu</r>
</e>
</pardef>
<pardef n="kond">
<e>
<l/><r></r>
</e>
</pardef>
</pardefs>
<section id="main" type="standard"> <e lm="povuka"> po <par n="poyi_v"/> </e> <e lm="athu"> <par n="athu_n"/> </e> <e lm="kond"> kond <par n="kond"/> </e> </section>
</dictionary></source> The above given code returned this result:
Input:
poyathukond
poyaathukond
poyathu
athukond
thu
thukond
Output:
^poyathukond/po+athu<noun>+kond<noun>$
^poyaathukond/po+athu<noun>+kond<noun>$
^poyathu/po+athu<noun>/po+athu<noun>$
^athukond/athu<noun>+kond<noun>$
^thu/athu<noun>$
^thukond/athu<noun>+kond<noun>$
As can be seen, the program resolves even wrongly spelled compounds. The current lttoolbox markup doesn't give more control on that. The compound need to be regenerated from the stems and matched with the input to check the spelling.