From SMC Wiki
From: <email@example.com> Date: 2008-12-20 17:49 GMT+05:30 To: firstname.lastname@example.org Hi, Thanks for your help in hyphenation. Now I need your help in preparing the hunspell dictionaries for Indian languages. Let me explain the nature of agglutination present in Indian languages Malayalam, ml_IN is highly inflectional and agglutinative language. I will explain one example: മഴക്കാലമേഘങ്ങളെല്ലാമിരുണ്ടുകൂടി this is a compound word. It can be split like മഴ(Rain,noun) + കാല (Season, inflected form of കാലം) + മേഘ(Cloud, inflected form of മേഘം) +ങ്ങള്(plural for form suffix for previous word) + എല്ലാം(meaning: All ) + ഇരുണ്ടു (Darken, verb) + കൂടി (suffix for previous word to mean Darkened ) Another example: പൂച്ചക്കുട്ടിയുടെ : Meaning(Of kitten) Split: പൂച്ച (cat) + കുട്ടി (child) + ഉടെ (Inflection to show 'of') I was going through the documentation present here: https://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754 I did a simple one level suffix .aff file.. But how can we handle agglutination, that may sometime go to more than 4, 5 words? And each word may be in inflected form? Expecting your inputs on approaching this problem Thanks Santhosh ---------- From: Németh László <email@example.com> Date: 2008-12-22 22:09 GMT+05:30 To: firstname.lastname@example.org Cc: email@example.com Hi, There are several method to use Hunspell to handle complex compounding and affixation. You need basic compound flag(s) (COMPOUNDFLAG or COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND) on dictionary items or _on suffixes_ to permit compounding. You can use COMPOUNDPERMITFLAG and COMPOUNDFORBIDFLAGS on affixes to permit affixed word forms within compounds or to forbid affixed word forms in compound. The default for compound words is one prefix and one suffix in the beginning and in the end of the word: (arbitrary_prefix)?(compound)+(arbitrary_suffix). ONLYINCOMPOUND flags or word forms also help to reduce the complexity of affixation and compounding (for example, some German dictionaries contain redundant affixed forms in the dic file with ONLYINCOMPOUND flags to handle Fuge-elements i. e. affixation within compounds combined with decapitalization of German nouns). For complex agglutination, see my answer in this bug report: https://sourceforge.net/tracker/index.php?func=detail&aid=2413299&group_id=143754&atid=756395 Use these methods only if you really need it to generate (accept) millions of compound words. Example (also attached): --- affix file --- SET UTF-8 COMPOUNDMIN 1 COMPOUNDFLAG c ONLYINCOMPOUND x COMPOUNDPERMITFLAG p COMPOUNDEND e SFX a Y 1 SFX a ം ങ്ങള്/cp ം SFX b Y 1 SFX b 0 കൂടി/e . ----- dic file ----- 5 മഴ/c കാല/cx മേഘം/a എല്ലാം/c ഇരുണ്ടു/b ----- echo "മഴകാലമേഘങ്ങള്എല്ലാംഇരുണ്ടുകൂടി" | hunspell -d ml_IN -w Unfortunately, this works only with Hunspell 1.1.12 now! It seems, UTF-8 COMPOUNDMIN doesn't work well in Hunspell 1.2.8. I fix it, and you can use the extended CHECKCOMPOUNDPATTERN of Hunspell 1.2.8 for special UTF-8 character transitions at compound boundaries, if you need it (it seems for me from your example, but I don't know). Regards, László 2008/12/20 <firstname.lastname@example.org>: ---------- From: <email@example.com> Date: 2008-12-22 22:23 GMT+05:30 To: Németh László <firstname.lastname@example.org> Hi, Thanks for the detailed reply. I will study this during Christmas holidays. I will get back to you in case if any doubts. Happy Xmas and New year!!! -Santhosh ---------- From: <email@example.com> Date: 2008-12-25 17:50 GMT+05:30 To: Németh László <firstname.lastname@example.org> Hi, I am getting interesting results with hunspell. Attached the .dic. .add and test, result file . I am wondering what is the meaning of this result "മരത്തിലെ" is incorrect! suggestions: ..."മരത്തിലെ" word and suggestion are same! I am using the hunspell/tools/example code for testing Thanks Santhosh On Monday 22 December 2008 22:09:50 you wrote: ---------- From: Németh László <email@example.com> Date: 2008-12-30 20:12 GMT+05:30 To: firstname.lastname@example.org Cc: email@example.com Hi, Good news: my example works with Hunspell 1.2.8, too. But instead of the "hunspell" executable, use the "example" application for testing. (Tokenization by "hunspell" executable has some problem). Happy New Year! László 2008/12/22 <firstname.lastname@example.org>: ---------- From: Németh László <email@example.com> Date: 2009-01-05 16:37 GMT+05:30 To: firstname.lastname@example.org Hi, 2008/12/25 <email@example.com>: > Hi, > > I am getting interesting results with hunspell. Attached the .dic. .add and > test, result file . > > I am wondering what is the meaning of this result > > "മരത്തിലെ" is incorrect! > > suggestions: > > ..."മരത്തിലെ" > > word and suggestion are same! I am working on the problem. Thanks for the report! > > I am using the hunspell/tools/example code for testing I am glad to see you have also found the working tool. :) Regards, László ---------- From: Santhosh Thottingal <firstname.lastname@example.org> Date: 2009-01-18 10:30 GMT+05:30 To: Németh László <email@example.com> 2009/1/5 Németh László <firstname.lastname@example.org>: > Hi, > > 2008/12/25 <email@example.com>: >> Hi, >> >> I am getting interesting results with hunspell. Attached the .dic. .add and >> test, result file . >> >> I am wondering what is the meaning of this result > I am working on the problem. Thanks for the report! Hi, Any updates on this problems? Thanks Santhosh End of conversation ------------------------------------------------------------------------------------ me: nemeth, do u have few minutes? I have some questions on hunspell 1:23 PM Németh: Hi, 1:24 PM me: I was discussing the fedora /ubuntu move towards hunspell as a single spell checker in distro.. they were raising an issue related to "soundslike " do we have that feature in hunspell? 1:25 PM for Hindi (hi_IN) we have that rules in place for aspell Németh: Yes, see PHONE in the manual. You can use Aspell phone table with PHONE me: let me take the manual... 1:26 PM Németh: But Unicode encoding is not supported perfectly yet with PHONE. me: oh, we want unicode Németh: http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754 1:27 PM Simple transformation rules of Aspell phone table will work Unicode, but not character matching patterns (asterisk, ) 1:28 PM me: do u have a sample for any language in unicode? 1:29 PM Németh: See the en_US.zip package on hunspell.sf.net for a PHONE example. 1:30 PM http://downloads.sourceforge.net/hunspell/en_US.zip?modtime=1188377055&big_mirror=0 1:31 PM me: ok, thanks, I will check that...I will get back to you after that.. if I have any questions... 1:33 PM Németh: Ok, you are welcome. me: Hi, got few minutes? got a doubt in Hunspell Németh: Hi, a few minutes... 9:09 PM me: I was trying REP and PHONE in .aff file I did not get the difference between both. The PHONE rules did not work for my hindi rules. But REP is working 9:10 PM How PHONE and REP differs? both are used for replacing some string with other , right? 9:13 PM Németh: Unfortunately, PHONE works only with ASCII characters, yet! me: oh, my content was hindi Németh: (Maybe with 8 bit characters, too.) me: Unicode 9:14 PM Németh: I was my fault, sorry. me: no problem :) Németh: In fact, I have just recognised this problem, too. Thanks for your question. :) me: I am getting required replacement if I use REP 9:15 PM Actually I want replacement only. So how this REP and PHONE differs? 9:17 PM I was trying to port Aspell phonetic rules to Hunspell 9:18 PM Németh: REP makes a single replacement and check the result. PHONE is the Aspell phonetic algorithm (but encoding problems in Hunspell). 9:20 PM Aspell uses a 8-bit mapping for Unicode, so it's 8-bit PHONE algorithm works with Hindi. 9:21 PM But I have to rewrite it for Unicode (or use some third-party regex engine). 9:22 PM I have to go, but I will be reachable by e-mail. Bye.