SMC Wiki - User contributions [en]

User:Jaseem

2014-03-12T16:57:42Z

Jaseem:

==Personal Information==
*'''Email Address:''' jaseemumar@gmail.com
*'''Blog URL:''' http://jaseems.blogspot.com
*'''Freenode IRC nick:''' jaseem
*'''Current Education:''' 2nd Year BTech in Computer Science at Indian Institute of Technology, Bombay
*'''Why do you want to work with the Swathanthra Malayalam Computing?'''
Being a malayali, the cause of developing Malayalam computing aids is something I can relate to and am excited about. The possibility of being able to help people who speak the same language as mine, directly with what I learned is exciting.
*'''Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?'''
No
*'''Did you participate with the past GSoC programs, if so which years, which organizations?''
No
*'''Do you have other obligations between May and August ?'''
I have my college holidays from May to July middle, during which I don't have any obligations. I have to attend college during last two weeks of July and August; I am planning to make up for this period by starting coding a bit earlier during the community bonding period.
*'''Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2014 program, if yes, which area(s), you are interested in?"
Yes, I am glad I found the organisation through GSoC and I am planning to actively contribute outside of the program.
*'''Why should we choose you over other applicants?'''
I am experienced in programming for over 6 years and am good at Python, the language of the existing spell-checker. I have direct access to the language resources (books and people) required for the project.

==Proposal Description==
Please describe your proposal in detail.

===Overview===

===Implementation===
===Timeline===

User:Jaseem/spellcheck

2014-03-11T23:07:44Z

Jaseem: /* Lttoolbox */

= Malayalam Spell-checker =
== Problem==
English dictionaries "rely on complete lists of full word forms, a requirement that
cannot be met for morphologically complex languages" like Malayalam.
Theoretically, In Malayalam agglutination of unlimited words can happen. Generally less than 10. Handling agglutinations and inflections in a spell-checker can be challenging.

Refer http://thottingal.in/documents/MalayalamComputingChallenges.pdf

=== Other Challenges ===
*Homophonic root words can have difference inflections
*;മറക്കുക & മറയുക; പറയുക & പറക്കുക
*Same word can inflect differently in same context (not common)
*; പോവുക, പോകുക
*Sandhi rules are complex.

==Possible solutions==
===Hunspell===
Hunspell has an algorithm for figuring out agglutination. Need to figure out how to use it.

===Implementation in other languages===
Spell Checking an Agglutinative Language: Quechua
http://www.zora.uzh.ch/52921/1/ltc-106-rios.pdf
Quechua, doesn't seem to have the complexity that malayalam sandhi's have. The automaton presented in the paper doesn't seem to work on malayalam.
*;kachichasqa= kachi + cha +sqa

http://www.cmpe.boun.edu.tr/~akin/papers/spelling_checking_in_Turkish.pdf

http://arxiv.org/pdf/cmp-lg/9410004.pdf

;Stemmer: For finding root words
http://www.ldcil.org/up/conferences/morph/presentations/Vijay%20[Compatibility%20Mode].pdf
http://www.cse.iitb.ac.in/~pb/papers/cicling12-stemming.pdf

===Lttoolbox===
Lttoolbox from apertium package can be used to tokenize and lemmatize compounds/agglutination/inflections.

<source lang="xml"><dictionary>
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
<sdefs>
<sdef n="noun" />
<sdef n="s"/>
<sdef n="pl"/>
<sdef n="root"/>
<sdef n="past"/>
<sdef n="verb"/>
<sdef n="compound-only-L" c="May only be the left-side of a compound"/>
<sdef n="compound-R" c="May be the right-side of a compound, or a full word"/>
</sdefs>

<pardefs>

<pardef n="poyi_v">
<e> <l/> <r><s n="verb"/> </r> </e>
<e> <l>yi</l> <r><s n="verb"/><s n="past"/></r></e>
<e> <l>ya</l> <r><s n="compound-only-L"/></r></e>
</pardef>

<pardef n="athu_n">
<e> <l>athu</l> <r>athu<s n="noun"/></r> </e>
<e> <l>athu</l> <r>athu<s n="noun"/><s n="compound-only-L"/></r></e>
<e> <l>thu</l> <r>athu<s n="noun"/><s n="compound-R"/><s n="compound-only-L"/></r></e>
</pardef>

<pardef n="kond">
<e><l/><r><s n="noun"/><s n="compound-R"/></r></e>
</pardef>

</pardefs>

<section id="main" type="standard">
<e lm="povuka">
po
<par n="poyi_v"/>
</e>
<e lm="athu">

<par n="athu_n"/>
</e>
<e lm="kond">
kond
<par n="kond"/>
</e>
</section>

</dictionary></source>
The above given code returned this result:

<code>
'''Input:''' 
poyathukond 
poyaathukond 
poyathu 
athukond 
thu 
thukond 

'''Output:''' 
^poyathukond/po+athu<noun>+kond<noun>$ 
^poyaathukond/po+athu<noun>+kond<noun>$ 
^poyathu/po+athu<noun>/po+athu<noun>$ 
^athukond/athu<noun>+kond<noun>$ 
^thu/athu<noun>$ 
^thukond/athu<noun>+kond<noun>$ 
</code>

As can be seen, the program resolves even wrongly spelled compounds. The current lttoolbox markup doesn't give more control on that. The compound need to be regenerated from the stems and matched with the input to check the spelling.

User:Jaseem/spellcheck

2014-03-11T23:06:48Z

Jaseem:

= Malayalam Spell-checker =
== Problem==
English dictionaries "rely on complete lists of full word forms, a requirement that
cannot be met for morphologically complex languages" like Malayalam.
Theoretically, In Malayalam agglutination of unlimited words can happen. Generally less than 10. Handling agglutinations and inflections in a spell-checker can be challenging.

Refer http://thottingal.in/documents/MalayalamComputingChallenges.pdf

=== Other Challenges ===
*Homophonic root words can have difference inflections
*;മറക്കുക & മറയുക; പറയുക & പറക്കുക
*Same word can inflect differently in same context (not common)
*; പോവുക, പോകുക
*Sandhi rules are complex.

==Possible solutions==
===Hunspell===
Hunspell has an algorithm for figuring out agglutination. Need to figure out how to use it.

===Implementation in other languages===
Spell Checking an Agglutinative Language: Quechua
http://www.zora.uzh.ch/52921/1/ltc-106-rios.pdf
Quechua, doesn't seem to have the complexity that malayalam sandhi's have. The automaton presented in the paper doesn't seem to work on malayalam.
*;kachichasqa= kachi + cha +sqa

http://www.cmpe.boun.edu.tr/~akin/papers/spelling_checking_in_Turkish.pdf

http://arxiv.org/pdf/cmp-lg/9410004.pdf

;Stemmer: For finding root words
http://www.ldcil.org/up/conferences/morph/presentations/Vijay%20[Compatibility%20Mode].pdf
http://www.cse.iitb.ac.in/~pb/papers/cicling12-stemming.pdf

===Lttoolbox===
Lttoolbox from apertium package can be used to tokenize and lemmatize compounds/agglutination/inflections.

<source lang="xml"><dictionary>
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
<sdefs>
<sdef n="noun" />
<sdef n="s"/>
<sdef n="pl"/>
<sdef n="root"/>
<sdef n="past"/>
<sdef n="verb"/>
<sdef n="compound-only-L" c="May only be the left-side of a compound"/>
<sdef n="compound-R" c="May be the right-side of a compound, or a full word"/>
</sdefs>

<pardefs>

<pardef n="poyi_v">
<e> <l/> <r><s n="verb"/> </r> </e>
<e> <l>yi</l> <r><s n="verb"/><s n="past"/></r></e>
<e> <l>ya</l> <r><s n="compound-only-L"/></r></e>
</pardef>

<pardef n="athu_n">
<e> <l>athu</l> <r>athu<s n="noun"/></r> </e>
<e> <l>athu</l> <r>athu<s n="noun"/><s n="compound-only-L"/></r></e>
<e> <l>thu</l> <r>athu<s n="noun"/><s n="compound-R"/><s n="compound-only-L"/></r></e>
</pardef>

<pardef n="kond">
<e><l/><r><s n="noun"/><s n="compound-R"/></r></e>
</pardef>

</pardefs>

<section id="main" type="standard">
<e lm="povuka">
po
<par n="poyi_v"/>
</e>
<e lm="athu">

<par n="athu_n"/>
</e>
<e lm="kond">
kond
<par n="kond"/>
</e>
</section>

</dictionary></source>
The above given code returned gave this result:

<code>
'''Input:''' 
poyathukond 
poyaathukond 
poyathu 
athukond 
thu 
thukond 

'''Output:''' 
^poyathukond/po+athu<noun>+kond<noun>$ 
^poyaathukond/po+athu<noun>+kond<noun>$ 
^poyathu/po+athu<noun>/po+athu<noun>$ 
^athukond/athu<noun>+kond<noun>$ 
^thu/athu<noun>$ 
^thukond/athu<noun>+kond<noun>$ 
</code>

As can be seen, the program resolves even wrongly spelled compounds. The current lttoolbox markup doesn't give more control on that. The compound need to be regenerated from the stems and matched with the input to check the spelling.

User:Jaseem/spellcheck

2014-03-06T14:21:10Z

Jaseem:

User:Jaseem/spellcheck

2014-03-06T14:13:24Z

Jaseem:

User:Jaseem/spellcheck

2014-03-03T14:15:39Z

Jaseem:

User:Jaseem/spellcheck

2014-03-03T11:31:12Z

Jaseem: Created page with "= Malayalam Spell-checker = == Problem== English dictionaries "rely on complete lists of full word forms, a requirement that cannot be met for morphologically complex language..."