https://wiki.smc.org.in/index.php?title=User:%E0%B4%B8%E0%B4%A8%E0%B5%8D%E0%B4%A4%E0%B5%8B%E0%B4%B7%E0%B5%8D/HunspellConversation&feed=atom&action=historyUser:സന്തോഷ്/HunspellConversation - Revision history2024-03-29T13:29:01ZRevision history for this page on the wikiMediaWiki 1.40.1https://wiki.smc.org.in/index.php?title=User:%E0%B4%B8%E0%B4%A8%E0%B5%8D%E0%B4%A4%E0%B5%8B%E0%B4%B7%E0%B5%8D/HunspellConversation&diff=4685&oldid=prevസന്തോഷ്: Created page with "<pre> From: <santhosh.thottingal@gmail.com> Date: 2008-12-20 17:49 GMT+05:30 To: nemeth.lacko@gmail.com Hi, Thanks for your help in hyphenation. Now I need your help in prep..."2014-03-08T11:17:19Z<p>Created page with "<pre> From: <santhosh.thottingal@gmail.com> Date: 2008-12-20 17:49 GMT+05:30 To: nemeth.lacko@gmail.com Hi, Thanks for your help in hyphenation. Now I need your help in prep..."</p>
<p><b>New page</b></p><div><pre><br />
From: <santhosh.thottingal@gmail.com><br />
Date: 2008-12-20 17:49 GMT+05:30<br />
To: nemeth.lacko@gmail.com<br />
<br />
<br />
Hi,<br />
Thanks for your help in hyphenation. Now I need your help in preparing the hunspell dictionaries for Indian languages.<br />
Let me explain the nature of agglutination present in Indian languages<br />
Malayalam, ml_IN is highly inflectional and agglutinative language.<br />
I will explain one example:<br />
മഴക്കാലമേഘങ്ങളെല്ലാമിരുണ്ടുകൂടി<br />
this is a compound word. It can be split like<br />
മഴ(Rain,noun) + കാല (Season, inflected form of കാലം) + മേഘ(Cloud, inflected form of മേഘം) +ങ്ങള്(plural for form suffix for previous word) + എല്ലാം(meaning: All ) +<br />
ഇരുണ്ടു (Darken, verb) + കൂടി (suffix for previous word to mean Darkened )<br />
<br />
Another example:<br />
പൂച്ചക്കുട്ടിയുടെ : Meaning(Of kitten)<br />
Split: പൂച്ച (cat) + കുട്ടി (child) + ഉടെ (Inflection to show 'of')<br />
<br />
<br />
I was going through the documentation present here: https://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754<br />
I did a simple one level suffix .aff file.. But how can we handle agglutination, that may sometime go to more than 4, 5 words? And each word may be in inflected form?<br />
Expecting your inputs on approaching this problem<br />
<br />
Thanks<br />
Santhosh<br />
<br />
<br />
----------<br />
From: Németh László <nemeth@openoffice.org><br />
Date: 2008-12-22 22:09 GMT+05:30<br />
To: santhosh.thottingal@gmail.com<br />
Cc: hunspell-devel@lists.sourceforge.net<br />
<br />
<br />
Hi,<br />
<br />
There are several method to use Hunspell to handle complex compounding<br />
and affixation. You need basic compound flag(s) (COMPOUNDFLAG or<br />
COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND) on dictionary items or _on<br />
suffixes_ to permit compounding. You can use COMPOUNDPERMITFLAG and<br />
COMPOUNDFORBIDFLAGS on affixes to permit affixed word forms within<br />
compounds or to forbid affixed word forms in compound. The default for<br />
compound words is one prefix and one suffix in the beginning and in<br />
the end of the word: (arbitrary_prefix)?(compound)+(arbitrary_suffix).<br />
ONLYINCOMPOUND flags or word forms also help to reduce the complexity<br />
of affixation and compounding (for example, some German dictionaries<br />
contain redundant affixed forms in the dic file with ONLYINCOMPOUND<br />
flags to handle Fuge-elements i. e. affixation within compounds<br />
combined with decapitalization of German nouns).<br />
<br />
For complex agglutination, see my answer in this bug report:<br />
https://sourceforge.net/tracker/index.php?func=detail&aid=2413299&group_id=143754&atid=756395<br />
<br />
Use these methods only if you really need it to generate (accept)<br />
millions of compound words. Example (also attached):<br />
--- affix file ---<br />
SET UTF-8<br />
COMPOUNDMIN 1<br />
COMPOUNDFLAG c<br />
ONLYINCOMPOUND x<br />
COMPOUNDPERMITFLAG p<br />
COMPOUNDEND e<br />
<br />
SFX a Y 1<br />
SFX a ം ങ്ങള്/cp ം<br />
<br />
SFX b Y 1<br />
SFX b 0 കൂടി/e .<br />
<br />
----- dic file -----<br />
5<br />
മഴ/c<br />
കാല/cx<br />
മേഘം/a<br />
എല്ലാം/c<br />
ഇരുണ്ടു/b<br />
-----<br />
<br />
echo "മഴകാലമേഘങ്ങള്എല്ലാംഇരുണ്ടുകൂടി" | hunspell -d ml_IN -w<br />
<br />
Unfortunately, this works only with Hunspell 1.1.12 now! It seems,<br />
UTF-8 COMPOUNDMIN doesn't work well in Hunspell 1.2.8. I fix it, and<br />
you can use the extended CHECKCOMPOUNDPATTERN of Hunspell 1.2.8 for<br />
special UTF-8 character transitions at compound boundaries, if you<br />
need it (it seems for me from your example, but I don't know).<br />
<br />
Regards,<br />
László<br />
<br />
<br />
2008/12/20 <santhosh.thottingal@gmail.com>:<br />
<br />
<br />
----------<br />
From: <santhosh.thottingal@gmail.com><br />
Date: 2008-12-22 22:23 GMT+05:30<br />
To: Németh László <nemeth@openoffice.org><br />
<br />
<br />
Hi, Thanks for the detailed reply. I will study this during Christmas holidays. I will get back to you in case if any doubts.<br />
Happy Xmas and New year!!!<br />
-Santhosh<br />
<br />
<br />
----------<br />
From: <santhosh.thottingal@gmail.com><br />
Date: 2008-12-25 17:50 GMT+05:30<br />
To: Németh László <nemeth@openoffice.org><br />
<br />
<br />
Hi,<br />
I am getting interesting results with hunspell. Attached the .dic. .add and test, result file .<br />
I am wondering what is the meaning of this result<br />
<br />
"മരത്തിലെ" is incorrect!<br />
suggestions:<br />
..."മരത്തിലെ"<br />
<br />
word and suggestion are same!<br />
I am using the hunspell/tools/example code for testing<br />
<br />
Thanks<br />
Santhosh<br />
<br />
On Monday 22 December 2008 22:09:50 you wrote:<br />
<br />
<br />
----------<br />
From: Németh László <nemeth@openoffice.org><br />
Date: 2008-12-30 20:12 GMT+05:30<br />
To: santhosh.thottingal@gmail.com<br />
Cc: hunspell-devel@lists.sourceforge.net<br />
<br />
<br />
Hi,<br />
<br />
Good news: my example works with Hunspell 1.2.8, too. But instead of<br />
the "hunspell" executable, use the "example" application for testing.<br />
(Tokenization by "hunspell" executable has some problem).<br />
<br />
Happy New Year!<br />
<br />
László<br />
<br />
2008/12/22 <santhosh.thottingal@gmail.com>:<br />
<br />
<br />
----------<br />
From: Németh László <nemeth@openoffice.org><br />
Date: 2009-01-05 16:37 GMT+05:30<br />
To: santhosh.thottingal@gmail.com<br />
<br />
<br />
Hi,<br />
<br />
2008/12/25 <santhosh.thottingal@gmail.com>:<br />
> Hi,<br />
><br />
> I am getting interesting results with hunspell. Attached the .dic. .add and<br />
> test, result file .<br />
><br />
> I am wondering what is the meaning of this result<br />
><br />
> "മരത്തിലെ" is incorrect!<br />
><br />
> suggestions:<br />
><br />
> ..."മരത്തിലെ"<br />
><br />
> word and suggestion are same!<br />
<br />
I am working on the problem. Thanks for the report!<br />
<br />
><br />
> I am using the hunspell/tools/example code for testing<br />
<br />
I am glad to see you have also found the working tool. :)<br />
<br />
Regards,<br />
László<br />
<br />
<br />
----------<br />
From: Santhosh Thottingal <santhosh.thottingal@gmail.com><br />
Date: 2009-01-18 10:30 GMT+05:30<br />
To: Németh László <nemeth@openoffice.org><br />
<br />
<br />
2009/1/5 Németh László <nemeth@openoffice.org>:<br />
> Hi,<br />
><br />
> 2008/12/25 <santhosh.thottingal@gmail.com>:<br />
>> Hi,<br />
>><br />
>> I am getting interesting results with hunspell. Attached the .dic. .add and<br />
>> test, result file .<br />
>><br />
>> I am wondering what is the meaning of this result<br />
> I am working on the problem. Thanks for the report!<br />
<br />
Hi,<br />
Any updates on this problems?<br />
<br />
Thanks<br />
Santhosh<br />
<br />
End of conversation<br />
<br />
------------------------------------------------------------------------------------<br />
<br />
me: nemeth, do u have few minutes? I have some questions on hunspell<br />
1:23 PM Németh: Hi,<br />
1:24 PM me: I was discussing the fedora /ubuntu move towards hunspell as a single spell checker in distro..<br />
they were raising an issue related to "soundslike "<br />
do we have that feature in hunspell?<br />
1:25 PM for Hindi (hi_IN) we have that rules in place for aspell<br />
Németh: Yes, see PHONE in the manual.<br />
You can use Aspell phone table with PHONE<br />
me: let me take the manual...<br />
1:26 PM Németh: But Unicode encoding is not supported perfectly yet with PHONE.<br />
me: oh, we want unicode<br />
Németh: http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754<br />
1:27 PM Simple transformation rules of Aspell phone table will work Unicode, but not character matching patterns (asterisk, [])<br />
1:28 PM me: do u have a sample for any language in unicode?<br />
1:29 PM Németh: See the en_US.zip package on hunspell.sf.net for a PHONE example.<br />
1:30 PM http://downloads.sourceforge.net/hunspell/en_US.zip?modtime=1188377055&big_mirror=0<br />
1:31 PM me: ok, thanks, I will check that...I will get back to you after that.. if I have any questions...<br />
1:33 PM Németh: Ok,<br />
you are welcome.<br />
<br />
<br />
<br />
me: Hi, got few minutes? got a doubt in Hunspell<br />
Németh: Hi, a few minutes...<br />
9:09 PM me: I was trying REP and PHONE in .aff file<br />
I did not get the difference between both. The PHONE rules did not work for my hindi rules. But REP is working<br />
9:10 PM How PHONE and REP differs?<br />
both are used for replacing some string with other , right?<br />
9:13 PM Németh: Unfortunately, PHONE works only with ASCII characters, yet!<br />
me: oh, my content was hindi<br />
Németh: (Maybe with 8 bit characters, too.)<br />
me: Unicode<br />
9:14 PM Németh: I was my fault, sorry.<br />
me: no problem :)<br />
Németh: In fact, I have just recognised this problem, too.<br />
Thanks for your question. :)<br />
me: I am getting required replacement if I use REP<br />
9:15 PM Actually I want replacement only. So how this REP and PHONE differs?<br />
9:17 PM I was trying to port Aspell phonetic rules to Hunspell<br />
9:18 PM Németh: REP makes a single replacement and check the result. PHONE is the Aspell phonetic algorithm (but encoding problems in Hunspell).<br />
9:20 PM Aspell uses a 8-bit mapping for Unicode, so it's 8-bit PHONE algorithm works with Hindi.<br />
9:21 PM But I have to rewrite it for Unicode (or use some third-party regex engine).<br />
9:22 PM I have to go, but I will be reachable by e-mail. Bye.<br />
<br />
</pre></div>സന്തോഷ്