User:സന്തോഷ്/HunspellConversation

From SMC Wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
From: <santhosh.thottingal@gmail.com>
Date: 2008-12-20 17:49 GMT+05:30
To: nemeth.lacko@gmail.com


Hi,
Thanks for your help in hyphenation. Now I need your help in preparing the hunspell dictionaries for Indian languages.
Let me explain the nature of agglutination present in Indian languages
Malayalam, ml_IN is highly inflectional and agglutinative language.
I will explain one example:
മഴക്കാലമേഘങ്ങളെല്ലാമിരുണ്ടുകൂടി
this is a compound word. It can be split like
മഴ(Rain,noun) + കാല (Season, inflected form of കാലം) + മേഘ(Cloud, inflected form of മേഘം) +ങ്ങള്‍(plural for form suffix for previous word) + എല്ലാം(meaning: All ) +
ഇരുണ്ടു (Darken, verb) + കൂടി (suffix for previous word to mean Darkened )

Another example:
പൂച്ചക്കുട്ടിയുടെ : Meaning(Of kitten)
Split: പൂച്ച (cat) + കുട്ടി (child) + ഉടെ (Inflection to show 'of')


I was going through the documentation present here: https://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754
I did a simple one level suffix .aff file.. But how can we handle agglutination, that may sometime go to more than 4, 5 words? And each word may be in inflected form?
Expecting your inputs on approaching this problem

Thanks
Santhosh


----------
From: Németh László <nemeth@openoffice.org>
Date: 2008-12-22 22:09 GMT+05:30
To: santhosh.thottingal@gmail.com
Cc: hunspell-devel@lists.sourceforge.net


Hi,

There are several method to use Hunspell to handle complex compounding
and affixation. You need basic compound flag(s) (COMPOUNDFLAG or
COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND) on dictionary items or _on
suffixes_ to permit compounding. You can use COMPOUNDPERMITFLAG and
COMPOUNDFORBIDFLAGS on affixes to permit affixed word forms within
compounds or to forbid affixed word forms in compound. The default for
compound words is one prefix and one suffix in the beginning and in
the end of the word: (arbitrary_prefix)?(compound)+(arbitrary_suffix).
ONLYINCOMPOUND flags or word forms also help to reduce the complexity
of affixation and compounding (for example, some German dictionaries
contain redundant affixed forms in the dic file with ONLYINCOMPOUND
flags to handle Fuge-elements i. e. affixation within compounds
combined with decapitalization of German nouns).

For complex agglutination, see my answer in this bug report:
https://sourceforge.net/tracker/index.php?func=detail&aid=2413299&group_id=143754&atid=756395

Use these methods only if you really need it to generate (accept)
millions of compound words. Example (also attached):
--- affix file ---
SET UTF-8
COMPOUNDMIN 1
COMPOUNDFLAG c
ONLYINCOMPOUND x
COMPOUNDPERMITFLAG p
COMPOUNDEND e

SFX a Y 1
SFX a ം ങ്ങള്‍/cp ം

SFX b Y 1
SFX b 0 കൂടി/e .

----- dic file -----
5
മഴ/c
കാല/cx
മേഘം/a
എല്ലാം/c
ഇരുണ്ടു/b
-----

echo "മഴകാലമേഘങ്ങള്‍എല്ലാംഇരുണ്ടുകൂടി" | hunspell -d ml_IN -w

Unfortunately, this works only with Hunspell 1.1.12 now! It seems,
UTF-8 COMPOUNDMIN doesn't work well in Hunspell 1.2.8. I fix it, and
you can use the extended CHECKCOMPOUNDPATTERN of Hunspell 1.2.8 for
special UTF-8 character transitions at compound boundaries, if you
need it (it seems for me from your example, but I don't know).

Regards,
László


2008/12/20  <santhosh.thottingal@gmail.com>:


----------
From: <santhosh.thottingal@gmail.com>
Date: 2008-12-22 22:23 GMT+05:30
To: Németh László <nemeth@openoffice.org>


Hi, Thanks for the detailed reply. I will study this during Christmas holidays. I will get back to you in case if any doubts.
Happy Xmas and New year!!!
-Santhosh


----------
From: <santhosh.thottingal@gmail.com>
Date: 2008-12-25 17:50 GMT+05:30
To: Németh László <nemeth@openoffice.org>


Hi,
I am getting interesting results with hunspell. Attached the .dic. .add and test, result file .
I am wondering what is the meaning of this result

"മരത്തിലെ" is incorrect!
suggestions:
..."മരത്തിലെ"

word and suggestion are same!
I am using the hunspell/tools/example code for testing

Thanks
Santhosh

On Monday 22 December 2008 22:09:50 you wrote:


----------
From: Németh László <nemeth@openoffice.org>
Date: 2008-12-30 20:12 GMT+05:30
To: santhosh.thottingal@gmail.com
Cc: hunspell-devel@lists.sourceforge.net


Hi,

Good news: my example works with Hunspell 1.2.8, too. But instead of
the "hunspell" executable, use the "example" application for testing.
(Tokenization by  "hunspell" executable has some problem).

Happy New Year!

László

2008/12/22  <santhosh.thottingal@gmail.com>:


----------
From: Németh László <nemeth@openoffice.org>
Date: 2009-01-05 16:37 GMT+05:30
To: santhosh.thottingal@gmail.com


Hi,

2008/12/25  <santhosh.thottingal@gmail.com>:
> Hi,
>
> I am getting interesting results with hunspell. Attached the .dic. .add and
> test, result file .
>
> I am wondering what is the meaning of this result
>
> "മരത്തിലെ" is incorrect!
>
> suggestions:
>
> ..."മരത്തിലെ"
>
> word and suggestion are same!

I am working on the problem. Thanks for the report!

>
> I am using the hunspell/tools/example code for testing

I am glad to see you have also found the working tool. :)

Regards,
László


----------
From: Santhosh Thottingal <santhosh.thottingal@gmail.com>
Date: 2009-01-18 10:30 GMT+05:30
To: Németh László <nemeth@openoffice.org>


2009/1/5 Németh László <nemeth@openoffice.org>:
> Hi,
>
> 2008/12/25  <santhosh.thottingal@gmail.com>:
>> Hi,
>>
>> I am getting interesting results with hunspell. Attached the .dic. .add and
>> test, result file .
>>
>> I am wondering what is the meaning of this result
> I am working on the problem. Thanks for the report!

Hi,
Any updates on this problems?

Thanks
Santhosh

End of conversation

------------------------------------------------------------------------------------

me: nemeth, do u have few minutes? I have some questions on hunspell
1:23 PM Németh: Hi,
1:24 PM me: I was discussing the fedora /ubuntu move towards hunspell as a single spell checker in distro..
  they were raising an issue related to "soundslike "
  do we have that feature in hunspell?
1:25 PM for Hindi (hi_IN) we have that rules in place for aspell
 Németh: Yes, see PHONE in the manual.
  You can use Aspell phone table with PHONE
 me: let me take the manual...
1:26 PM Németh: But Unicode encoding is not supported perfectly yet with PHONE.
 me: oh, we want unicode
 Németh: http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754
1:27 PM Simple transformation rules of Aspell phone table will work Unicode, but not character matching patterns (asterisk, [])
1:28 PM me: do u have a sample for any language in unicode?
1:29 PM Németh: See the en_US.zip package on hunspell.sf.net for a PHONE example.
1:30 PM http://downloads.sourceforge.net/hunspell/en_US.zip?modtime=1188377055&big_mirror=0
1:31 PM me: ok, thanks, I will check that...I will get back to you after that.. if I have any questions...
1:33 PM Németh: Ok,
  you are welcome.



me: Hi, got few minutes? got a doubt in Hunspell
 Németh: Hi, a few minutes...
9:09 PM me: I was trying REP and PHONE in .aff file
  I did not get the difference between both. The PHONE rules did not work for my hindi rules. But REP is working
9:10 PM How PHONE and REP differs?
  both are used for replacing some string with other , right?
9:13 PM Németh: Unfortunately, PHONE works only with ASCII characters, yet!
 me: oh, my content was hindi
 Németh: (Maybe with 8 bit characters, too.)
 me: Unicode
9:14 PM Németh: I was my fault, sorry.
 me: no problem :)
 Németh: In fact, I have just recognised this problem, too.
  Thanks for your question. :)
 me: I am getting required replacement if I use REP
9:15 PM Actually I want replacement only. So how this REP and PHONE differs?
9:17 PM I was trying to port Aspell phonetic rules to Hunspell
9:18 PM Németh: REP makes a single replacement and check the result. PHONE is the Aspell phonetic algorithm (but encoding problems in Hunspell).
9:20 PM Aspell uses a 8-bit mapping for Unicode, so it's 8-bit PHONE algorithm works with Hindi.
9:21 PM But I have to rewrite it for Unicode (or use some third-party regex engine).
9:22 PM I have to go, but I will be reachable by e-mail. Bye.