Hyphenation: Difference between revisions

Revision as of 09:30, 24 August 2009

This page explains the Hyphenation of Indian Languages In openoffice and webpages.

What is Hyphenation

Hyphenation is the process inserting hyphens in between the syllables of a word so that when the text is justified, maximum space is utilized.

Hyphenation is an important feature that DTP software provide. For Indian languages there is no good DTP software available. XeTex is the only choice to work with Unicode and professional quality page layout. But xetex and DTP are not exactly same. Inkscape can be used as temporary solution. But only for small scale works. There is a project going on to add Harfbuzz back end to Scribus, the freedom-ware DTP package.

Hyphenation is also required in many other places. Actually it is required where ever we ‘justify’ a block of text in openoffice or any word processors. Same is the case of web-pages.

Openoffice can take hyphenation dictionaries just like spell checkers. But for Indian languages, we are yet to prepare hyphenation dictionaries(more on that later.) . CSS3 draft of w3c has a provision for hyphenate. But it is stil in draft stage

Algorithm For Hyphenation

The basic for all hyphenation algorithms is the hyphenation algorithm, designed by Frank Liang in 1983, which is adopted in TeX. Wikipedia artcle of TeX explain this with very simple example

If TeX must find the acceptable hyphenation positions in the word encyclopedia, for example, it will consider all the sub words of the extended word .encyclopedia., where . is a special marker to indicate the beginning or end of the word. The list of sub words include all the sub words of length 1 (., e, n, c, y, etc), of length 2 (.e, en, nc, etc), etc, up to the sub word of length 14, which is the word itself, including the markers. TeX will then look into its list of hyphenation patterns, and find sub-words for which it has calculated the desirability of hyphenation at each position. In the case of our word, 11 such patterns can be matched, namely 1c4l4, 1cy, 1d4i3a, 4edi, e3dia, 2i1a, ope5d, 2p2ed, 3pedi, pedia4, y1c. For each position in the word, TeX will calculate the maximum value obtained among all matching pattern, yielding en1cy1c4l4o3p4e5d4i3a4. Finally, the acceptable positions are those indicated by an odd number, yielding the acceptable hyphenations en-cy-clo-pe-di-a. This system based on sub-words allows the definition of very general patterns (such as 2i1a), with low indicative numbers (either odd or even), which can then be superseded by more specific patterns (such as 1d4i3a) if necessary. These patterns find about 90% of the hyphens in the original dictionary; more importantly, they do not insert any spurious hyphen. In addition, a list of exceptions (words for which the patterns do not predict the correct hyphenation) are included with the Plain TeX format; additional ones can be specified by the user.

For more details about the algorithm used in Openoffice read this paper by Nemeth Laszlo

Hyphenation in Indian languages.

Unlike English or any other languages, hyphenation in Indian languages are not that much complex. In general following are the rules

[consonant][vowel][consonant] can be hyphenated as [consonant][vowel] – [consonant] if vowel is not a virama or halant
Don't split a word after ZWJ
We can split a word after ZWNJ
plus any language specific rules. For eg: in ml_IN a line should not start with a chillu letter.

Hyphenation Dictionaries for Indian languages.

Based on the above mentioned rules, Let us try to create hyphenation dictionaries for Indian languages. I will explain this with the help of a Hindi word example: अनुपल्ब्ध. We have to define the following rules in the dictionary for this

अ1 -> 1 is odd number , ie. word can be hyphenated after अ

ु1 -> 1 is odd number , ie. word can be hyphenated after ु

1ल -> 1 is odd number , ie. word can be hyphenated before ल

1प -> 1 is odd number , ie. word can be hyphenated before प

1ब -> 1 is odd number , ie. word can be hyphenated before ब

्2 -> 2 is even number , ie. word can NOT be hyphenated after ्

1ध -> 1 is odd number , ie. word can be hyphenated before ध

So the end result is अ+नु+प+ल्ब्ध

Hyphenation in Openoffice

Manual Installation

How to Install a xx_IN hyphenation dictionary :

Get the hyphenation dictionary from http://download.savannah.gnu.org/releases-noredirect/smc/hyphenation/patterns-0.5/
Copy the hyphenation dictionay hyph_xx_IN to /usr/share/myspell/dicts folder.
Create a file at /usr/share/myspell/infos/ooo/ folder named openoffice.org-hyphenation-xx with one line content: HYPH xx IN hyph_xx_IN
Run this command sudo update-openoffice-dicts

Installing in GNU/Linux Distributions

Fedora 11 onwards the hyphenation patters for Indian languages are available. You can install it using your package manager. The package name will be hyphen-xx where xx is langauge code like hi, gu, ml, te, ta, bn etc

Openoffice Extensions

Get the openoffice extensions for your language from the following links

How to use

Open the openoffice writer, Open some fille in your language or type some text. Justify the text. Set the language of the selection by using Tools->Language menu Hiphenate it by using Tools->Language->Hiphenation menu.

Download

The latest version of Openoffice hyphenation patterns can be found here : http://download.savannah.gnu.org/releases-noredirect/smc/hyphenation/patterns-0.5/

License

The openoffice hyphenation patterns for Indian languages are licensed under GNU Lesser General Public License version 3.0 or later versions

@@ Line 17: / Line 17: @@
 For more details about the algorithm used in Openoffice read [http://markmail.org/download.xqy?id=rwne7kf67ttyk62l&number=2 this paper by Nemeth Laszlo]
-==Hiphenation in Indian languages. ==
+==Hyphenation in Indian languages. ==
-Unlike English or any other languages, hiphenation in Indian languages are not that much complex. In general following are the rules
+Unlike English or any other languages, hyphenation in Indian languages are not that much complex. In general following are the rules
-* [consonant][vowel][consonat] can be hiphenated as [consonant][vowel] – [consonat] if vowel is not a virama or halant
+* [consonant][vowel][consonant] can be hyphenated as [consonant][vowel] – [consonant] if vowel is not a virama or halant
-* Dont split a word after ZWJ
+* Don't split a word after ZWJ
 * We can split a word after ZWNJ
 * plus any language specific rules. For eg: in ml_IN a line should not start with a chillu letter.
-Hiphenation Dictionaries for Indian languages.
+Hyphenation Dictionaries for Indian languages.
-Based on the above mentioned rules, Let us try to create hiphenation dictionaries for Indian languages. I will explain this with the help of a Hindi word example: अनुपल्ब्ध.
+Based on the above mentioned rules, Let us try to create hyphenation dictionaries for Indian languages. I will explain this with the help of a Hindi word example: अनुपल्ब्ध.
 We have to define the following rules in the dictionary for this
-अ1 -> 1 is odd number , ie. word can be splitterd after अ
+अ1 -> 1 is odd number , ie. word can be hyphenated after अ
-ु1 -> 1 is odd number , ie. word can be splitterd after ु
+ु1 -> 1 is odd number , ie. word can be hyphenated after ु
-ल -> 1 is odd number , ie. word can be splitterd before ल
+ल -> 1 is odd number , ie. word can be hyphenated before ल
-प -> 1 is odd number , ie. word can be splitterd before प
+प -> 1 is odd number , ie. word can be hyphenated before प
-ब -> 1 is odd number , ie. word can be splitterd before ब
+ब -> 1 is odd number , ie. word can be hyphenated before ब
-्2 -> 2 is even number , ie. word can NOT be splitterd after ्
+्2 -> 2 is even number , ie. word can NOT be hyphenated after ्
-ध -> 1 is odd number , ie. word can be splitterd before ध
+ध -> 1 is odd number , ie. word can be hyphenated before ध
 So the end result is अ+नु+प+ल्ब्ध
-Same way we can create the Hyphenation dictionaries for all other languages. I have prepared the Hyphenation dictionaries for 8 Indian Languages. Download it from the git repo of the SMC.
 == Hyphenation in Openoffice ==

Anonymous

Search

Hyphenation: Difference between revisions

Namespaces

More

Page actions

Revision as of 09:30, 24 August 2009

Contents

What is Hyphenation

Algorithm For Hyphenation

Hyphenation in Indian languages.

Hyphenation in Openoffice

Manual Installation

Installing in GNU/Linux Distributions

Openoffice Extensions

How to use

Download

License

Related Links

Navigation

Navigation

പ്രധാന കണ്ണികള്‍

പ്രാദേശികവത്കരണം

നിവേശകരീതികള്‍

സംഭാഷണോപാധികള്‍

ഉപകരണങ്ങള്‍

കല

പ്രസിദ്ധീകരണം

Wiki tools

Wiki tools

Anonymous

Search

Hyphenation: Difference between revisions

Revision as of 09:30, 24 August 2009

What is Hyphenation

Algorithm For Hyphenation

Hyphenation in Indian languages.

Hyphenation in Openoffice

Manual Installation

Installing in GNU/Linux Distributions

Openoffice Extensions

How to use

Download

License

Related Links

Navigation

Wiki tools

Page tools