User:Yash: Difference between revisions

From SMC Wiki
mNo edit summary
mNo edit summary
Line 101: Line 101:
* Some phenomes were missing. For example:
* Some phenomes were missing. For example:
print a.transliterate_en_hi('boy') gave output बाOY
print a.transliterate_en_hi('boy') gave output बाOY
We can find list of phenomes here www.speech.cs.cmu.edu/cgi-bin/cmudict
* Any mistakes yet to find :)
* Any mistakes yet to find :)



Revision as of 22:42, 12 April 2013

Lets get started:9 Apr'13

Hello all!

I am Yash Sinha, currently a student of BITS Pilani, India. Starting my WikiPage today.


Setting Up Repo:9 Apr'13

Today I tried to setup silpa git repo on my machine. I started early, because I knew there could be some difficulties. Initially, by mistake I cloned the old repository which made me face a lot of errors.

Later, I used the current on progress version of silpa at (github.com/Project-SILPA/) I cloned the folowing modules:

  • [Soudex ] (github.com/Project-SILPA/Soundex)
  • [ApproxSearch] (github.com/Project-SILPA/ApproxSearch)
  • [Transliteration] (github.com/Project-SILPA/Transliteration)
  • [Spellchecker] (github.com/Project-SILPA/spellchecker)
  • [Hyphenation] (github.com/Project-SILPA/Hyphenation)
  • [Chardetails] (github.com/Project-SILPA/chardetails)
  • [Payyans] (github.com/Project-SILPA/payyans)

And installed the following modules: Flask, Jinja2, Werkzeug and Virtualenv. Yeah there I did a mistake. Instead of Flask I should have use flask. To identify that mistake I had to almost reinstall my modules again.

Initially, I had used sudo python setup.py install to install the cloned modules. This was also an error (I suppose). Later I logged in as root and used python setup.py install.

What I learnt:

  • Python is a case-sensitive language. (F/flask) Yeah it is :)
  • use sudo command only when needed.
  • I also learnt to login to root both via terminal and gui.


Unicode .. Devanagari.. Transliteration …10 Apr'13

Today, I had a hectic but good day.

I tried to learn what transliteration is all about and how it works.

It is done using CMUDict, a pronunciation dictionary. If we have this word:

BENGAL B EH NG AH L

We have this mapping in the dictionary. This mapping is used to find its equivalent which is language specific. We then modify it according to the construct of the language and form a proper transliterated word.


I also made ‘hindi_english_dict’, ‘hi_vowels’, ‘hi_vowel_signs’ dictionaries, which are basically, sound mappings from hi to en_IN.

I also learnt about various unicode symbols of hindi like CANDRABINDU, ANUSVARA, VISARGA, VIRAMA etc. and their similar counterparts in kannada, telgu and malyalam.

Thanks to my friends for helping me here.

These websites also helped me a lot:

  • people.w3.org/rishida/scripts/uniview.fr/chars-devanagari.html
  • en.wiktionary.org/wiki/Appendix:Unicode/Devanagari
  • www.infowebservices.in/hindi/
  • en.wikipedia.org/wiki/Devanagari and
  • jrgraphix.net/r/Unicode/0900-097F

There is a nice list of devanagari characters also, in pdf format: www.unicode.org/charts/PDF/U0900.pdf

Finally, सभी को धन्यवाद!


Yeah! now it transliterates:12 Apr'13

Today, I added following to my copy of silpa code:

  • File: cmudict.py

_fix_vowel_signs_hi: This function replaces the vowel symbols with corresponding vowel signs. For example: "आ" -> "ा", "ए" -> "े" etc.

  • File: cmumapping.py

CMU_HINDI_MAP: This is basically a dictionary which maps 39 phenomes to their corresponding symbols in Hindi. For example: "AY" -> "ऐ" etc.

  • File: core.py

transliterate_en_hi: To translate en to hi, I added this function which made calls to cmu dict to transliterate. This was similar to what was written for ml and kn.

  • File: indic_en.py:

Improved 'hindi_english_dict', 'hi_vowels', 'hi_vowel_signs' dictionaries and added some more mappings to them.

Then, I verified the output and this is what I got:

  • print a.transliterate_en_hi('dog') -> "डौग"
  • print a.transliterate_en_hi('button') -> "बटन"
  • print a.transliterate_en_hi('desk') -> "डैसक"

Initially, it seemed ok, But later, I found out some mistakes:

  • print a.transliterate_en_hi('apple') -> "ेपल"
  • print a.transliterate_en_hi('english') -> "िन्गलिष"

Mistakes

  • I had not implemented normalization yet.
  • Some words like raam etc. were not present in the dict.
  • Some phenomes were missing. For example:

print a.transliterate_en_hi('boy') gave output बाOY We can find list of phenomes here www.speech.cs.cmu.edu/cgi-bin/cmudict

  • Any mistakes yet to find :)

Now I know better, how the code works. I also know where to move on.

Areas where my code let me feel the difference:

  • At silpa.org.in/Transliterate "book" transliterated to "ब्क " (bk), whereas my code transliterated it to "बुक" (buk).
  • "because" transliterated to "बिकोस" (bikoss) whereas my code transliterated it to "बिकौज़" (bicauz)

Areas to work on:

  • Add normalization
  • add missing phenomes
  • think on how to do it better, one way can be, to guess the pronunciation of the word if it is not present in CMU Dict.

For an informal version, see: sinhayash.wordpress.com/