CDAC-IDN-Critique: Difference between revisions

From SMC Wiki
(Added TTA forms - Syam Krishnan)
Line 9: Line 9:
=== General Comments===
=== General Comments===
# The variant table is defined based on random glyphs taken from a list of 900+ possible glyphs for Malayalam. No explanation is given on how two entries in variant table become homo morphs. One entry in variant table is just because of the fact that one is mirror image of other. Since b, d are not excluded from English, there's no need to exclude mirror imaged glyphs in variant table.
# The variant table is defined based on random glyphs taken from a list of 900+ possible glyphs for Malayalam. No explanation is given on how two entries in variant table become homo morphs. One entry in variant table is just because of the fact that one is mirror image of other. Since b, d are not excluded from English, there's no need to exclude mirror imaged glyphs in variant table.
#: CDAC: The IDN system devised for Malayalam is based only on the modern script. It doesn't address the old script or the fonts based on old script. Also, a detailed study was done before proposing homographs in each of the languages. The study included observing the visual form of the conjunct in the point size of the Address bars of major browsers. The mirror imaged nature of the glyphs was not the criterion for the two glyphs to be qualified as variants.
# Visually identical glyphs are not the only entries to be considered for the variant table. Unicode chart itself has ambiguous dual representations for the same code point without canonical equivalence. An example for this is au signs in Tamil and Malayalam. ௗ- ௌ and  ൗ - ൌ . The document does not consider these special cases.
# Visually identical glyphs are not the only entries to be considered for the variant table. Unicode chart itself has ambiguous dual representations for the same code point without canonical equivalence. An example for this is au signs in Tamil and Malayalam. ௗ- ௌ and  ൗ - ൌ . The document does not consider these special cases.
#: CDAC: The IDN policy does not permit the entry of syllables having structure CMM or MCM, where M stands for Matra or vowel sign. The ABNF rules takes care of this.
# There are different orthographic forms for many glyphs in Malayalam. The variant table does not address different scenarios arising while considering the visual similarity. For example in traditional orthography TTA is written in stacked form (റ്റ). While in modern orthography it can be written in non-stacked form and this non-stacked form is visually identical to two RA sequence (ററ).
# There are different orthographic forms for many glyphs in Malayalam. The variant table does not address different scenarios arising while considering the visual similarity. For example in traditional orthography TTA is written in stacked form (റ്റ). While in modern orthography it can be written in non-stacked form and this non-stacked form is visually identical to two RA sequence (ററ).
#:CDAC : Only the stacked form is considered to be the conjunct TTA in modern orthography.


=== ABNF rules ===
=== ABNF rules ===

Revision as of 14:20, 2 December 2010

Malayalam IDN Policy Draft by CDAC - Critique by SMC

The Draft policy document by CDAC is available here

Introduction

Issues with the approach and process

Criticism on the policy

General Comments

  1. The variant table is defined based on random glyphs taken from a list of 900+ possible glyphs for Malayalam. No explanation is given on how two entries in variant table become homo morphs. One entry in variant table is just because of the fact that one is mirror image of other. Since b, d are not excluded from English, there's no need to exclude mirror imaged glyphs in variant table.
    CDAC: The IDN system devised for Malayalam is based only on the modern script. It doesn't address the old script or the fonts based on old script. Also, a detailed study was done before proposing homographs in each of the languages. The study included observing the visual form of the conjunct in the point size of the Address bars of major browsers. The mirror imaged nature of the glyphs was not the criterion for the two glyphs to be qualified as variants.
  2. Visually identical glyphs are not the only entries to be considered for the variant table. Unicode chart itself has ambiguous dual representations for the same code point without canonical equivalence. An example for this is au signs in Tamil and Malayalam. ௗ- ௌ and ൗ - ൌ . The document does not consider these special cases.
    CDAC: The IDN policy does not permit the entry of syllables having structure CMM or MCM, where M stands for Matra or vowel sign. The ABNF rules takes care of this.
  3. There are different orthographic forms for many glyphs in Malayalam. The variant table does not address different scenarios arising while considering the visual similarity. For example in traditional orthography TTA is written in stacked form (റ്റ). While in modern orthography it can be written in non-stacked form and this non-stacked form is visually identical to two RA sequence (ററ).
    CDAC : Only the stacked form is considered to be the conjunct TTA in modern orthography.

ABNF rules

  1. Section 2 says ക് as pure consonant of ക. Chillu of ക്‍ is considered as pure consonant of ka.
  2. Section 2.a says CM can be followed by only D (anuswara) or X (visarga). This excludes the Samvruthokarams of Malayalam. All consonant can have cons + u vowel sign + virama and forming samvruthokaram form of that consonant. Examples: തു് , കു് , പു് , രു് .
  3. Section 3.a restrict the count of consonant in syllable as 4. But ഗ്ദ്ധ്ര്യ has 5 consonants
  4. Section 3.b excludes syllables with samvruthokram like ക്കു് .
  5. Section 4 states a chillu can be followed by a vowel sign. Since chillu is dead consonant, there is no possibility of having virama after chillu.
  6. The example used for LHC - ന്‍്റ does not exist in printing or digital format. None of the input methods or Malayalam writers ന്റ in this way. The sequence for nta is ന + ് + റ . ie there is no LHC sequence in Malayalam.
  7. Since LHC is invalid for Malayalam, including L = ന്‍ , section 5 of the document cancels itself.
  8. Because of argument #6, section 6 also cancels itself.
  9. Because of arguments #1 to #8 the IDN rule "Consonant Sequence → *3(CH) C [H|D|X|M[D|X]] | L[HC[D|H|M[D]]]" is completely wrong and need to be reformulated.


Restriction Rules

  1. Section 2 says "H is not permitted after V, D, X, M, digit and dash" This is wrong since samvruthokaram requires H after V
  2. Section 7 says H can follow L if it is followed by റ , This is wrong as explained above. L can never followed by H. It can only followed by C


nta criticism

This document does not address the case of stacked and non stacked forms of nta, which are interchangeably used. For example എന്റെ can be spoofed with എന്‍റെ. Severity of this issue is increased by having one more sequence to represent the same conjunct (ന്‍ + ് + റ) is introduced in Unicode 5.1

Chart of allowed characters

  1. Malayalam chillus - the 5.1 version ക്‍ is removed from the tables. which is having same characteristics and use cases of other chillus. So excluding it from the allowed code points does not make any sense. Moreover the existing chillu representation - non-atomic - is not mentioned in the document at all.
  2. Malayalam au sign - ൌ is not allowed. Instead the au length mark ൗ is provided. The inscript standard does not allow one to type ൗ and allows only ൌ. Other input methods allows to type both. But the document does not say anything on the equivalence of both. Allowing both vowel signs is also a spoofing issue. And hence this should be handled in variant table.

Variant Table and Visual Spoofing

Variant table is not logical. Only ളള and ള്ള makes sense. None of the other entries should be considered as spoofing. ന്ത and ന്ന is not even close. Mirror images are already used in Latin, eg. b and d. Hence സ്സ and ഡ്ഡ cannot be blocked. Moreover it is not clear why the same logic does not apply for സ and ഡ. It did not consider the case of ററ and non stacked form of റ്റ common in new lipi.

Even though similarity is considered, dual encoding is not mentioned. In case of dual encoding of chillus, both forms (atomic chillu and consonant chandrakkala ZWJ) of chillus will look SAME.

Conclusion