User:Yash: Difference between revisions

From SMC Wiki
No edit summary
No edit summary
 
(8 intermediate revisions by the same user not shown)
Line 14: Line 14:
Later, I used the current on progress version of silpa at (github.com/Project-SILPA/)
Later, I used the current on progress version of silpa at (github.com/Project-SILPA/)
I cloned the folowing modules:
I cloned the folowing modules:
* [Soudex ](github.com/Project-SILPA/Soundex)
* [Soudex ] (github.com/Project-SILPA/Soundex)
* [ApproxSearch](github.com/Project-SILPA/ApproxSearch)
* [ApproxSearch] (github.com/Project-SILPA/ApproxSearch)
* [Transliteration](github.com/Project-SILPA/Transliteration)
* [Transliteration] (github.com/Project-SILPA/Transliteration)
* [Spellchecker](github.com/Project-SILPA/spellchecker)
* [Spellchecker] (github.com/Project-SILPA/spellchecker)
* [Hyphenation](github.com/Project-SILPA/Hyphenation)
* [Hyphenation] (github.com/Project-SILPA/Hyphenation)
* [Chardetails](github.com/Project-SILPA/chardetails)
* [Chardetails] (github.com/Project-SILPA/chardetails)
* [Payyans](github.com/Project-SILPA/payyans)
* [Payyans] (github.com/Project-SILPA/payyans)


And installed the following modules: Flask, Jinja2, Werkzeug and Virtualenv. Yeah there I did a ''mistake''. Instead of ''Flask'' I should have use ''flask''. To identify that mistake I had to almost reinstall my modules again.
And installed the following modules: Flask, Jinja2, Werkzeug and Virtualenv. Yeah there I did a ''mistake''. Instead of ''Flask'' I should have use ''flask''. To identify that mistake I had to almost reinstall my modules again.
Line 31: Line 31:
* use ''sudo'' command only when needed.
* use ''sudo'' command only when needed.
* I also learnt to login to root both via terminal and gui.
* I also learnt to login to root both via terminal and gui.
== Unicode .. Devanagari.. Transliteration …10 Apr'13 ==
Today, I had a hectic but good day.
''I tried to learn what transliteration is all about and how it works.''
It is done using CMUDict, a pronunciation dictionary. If we have this word:
BENGAL B EH NG AH L
We have this mapping in the dictionary. This mapping is used to find its equivalent which is language specific. We then modify it according to the construct of the language and form a proper transliterated word.
''I also made ‘hindi_english_dict’, ‘hi_vowels’, ‘hi_vowel_signs’ dictionaries, which are basically, sound mappings from hi to en_IN.''
I also learnt about various unicode symbols of hindi like CANDRABINDU, ANUSVARA, VISARGA, VIRAMA etc. and their similar counterparts in kannada, telgu and malyalam.
Thanks to my friends for helping me here.
These websites also helped me a lot:
*people.w3.org/rishida/scripts/uniview.fr/chars-devanagari.html
*en.wiktionary.org/wiki/Appendix:Unicode/Devanagari
*www.infowebservices.in/hindi/
*en.wikipedia.org/wiki/Devanagari and
*jrgraphix.net/r/Unicode/0900-097F
There is a nice list of devanagari characters also, in pdf format: www.unicode.org/charts/PDF/U0900.pdf
Finally,  सभी को धन्यवाद!
== Yeah! now it transliterates:12 Apr'13 ==
'''Today, I added following to my copy of silpa code:'''
* File: cmudict.py
_fix_vowel_signs_hi: This function replaces the vowel symbols with corresponding vowel signs. For example:
"आ" -> "ा", "ए" -> "े" etc.
* File: cmumapping.py
CMU_HINDI_MAP: This is basically a dictionary which maps 39 phenomes to their corresponding symbols in Hindi. For example: "AY" -> "ऐ" etc.
* File: core.py
transliterate_en_hi: To translate en to hi, I added this function which made calls to cmu dict to transliterate. This was similar to what was written for ml and kn.
* File: indic_en.py:
Improved 'hindi_english_dict', 'hi_vowels', 'hi_vowel_signs' dictionaries and added some more mappings to them.
Then, I verified the output and this is what I got:
* print a.transliterate_en_hi('dog') -> "डौग"
* print a.transliterate_en_hi('button') -> "बटन"
* print a.transliterate_en_hi('desk') -> "डैसक"
Initially, it seemed ok, But later, I found out some mistakes:
* print a.transliterate_en_hi('apple') -> "ेपल"
* print a.transliterate_en_hi('english') -> "िन्गलिष"
'''Mistakes'''
* I had not implemented normalization yet.
* Some words like raam etc. were not present in the dict.
* Some phenomes were missing. For example:
print a.transliterate_en_hi('boy') gave output बाOY
We can find list of phenomes here www.speech.cs.cmu.edu/cgi-bin/cmudict
* Any mistakes yet to find :)
Now I know better, how the code works. I also know where to move on.
'''Areas where my code let me feel the difference:'''
* At  silpa.org.in/Transliterate "book" transliterated to "ब्क " (bk), whereas my code transliterated it to "बुक" (buk).
* "because" transliterated to "बिकोस" (bikoss) whereas my code transliterated it to "बिकौज़" (bicauz)
'''Areas to work on:'''
* Add normalization
* add missing phenomes
* think on how to do it better, one way can be, to guess the pronunciation of the word if it is not present in CMU Dict.
For an informal version, see: sinhayash.wordpress.com/
== Uploaded Application:2 May'13 ==
I installed the modules using Virtualenv and virtualenvwrapper.
My application is at wiki.smc.org.in/User:Yash/Application
== CLDR Tools with Babel:4 May'13 ==
Babel can be used to add CLDR tools to silpa.
* Babel is a collection of tools for internationalizing Python applications. It has a python interface to the CLDR (Common Locale Data Repository), providing access to various locale display names, localized number and date formatting, etc. [1]
* It can be integrated with Jinja2. [2]
* It can enhance the transliteration system in a variety of ways like
Locale Display Names
Translation of countries
Translation of calendar (months etc.) [3]
[1] babel.edgewall.org/
[2] jinja.pocoo.org/docs/integration/#babel-integration
[3] babel.edgewall.org/wiki/Documentation/display.html

Latest revision as of 17:12, 4 May 2013

Lets get started:9 Apr'13

Hello all!

I am Yash Sinha, currently a student of BITS Pilani, India. Starting my WikiPage today.


Setting Up Repo:9 Apr'13

Today I tried to setup silpa git repo on my machine. I started early, because I knew there could be some difficulties. Initially, by mistake I cloned the old repository which made me face a lot of errors.

Later, I used the current on progress version of silpa at (github.com/Project-SILPA/) I cloned the folowing modules:

  • [Soudex ] (github.com/Project-SILPA/Soundex)
  • [ApproxSearch] (github.com/Project-SILPA/ApproxSearch)
  • [Transliteration] (github.com/Project-SILPA/Transliteration)
  • [Spellchecker] (github.com/Project-SILPA/spellchecker)
  • [Hyphenation] (github.com/Project-SILPA/Hyphenation)
  • [Chardetails] (github.com/Project-SILPA/chardetails)
  • [Payyans] (github.com/Project-SILPA/payyans)

And installed the following modules: Flask, Jinja2, Werkzeug and Virtualenv. Yeah there I did a mistake. Instead of Flask I should have use flask. To identify that mistake I had to almost reinstall my modules again.

Initially, I had used sudo python setup.py install to install the cloned modules. This was also an error (I suppose). Later I logged in as root and used python setup.py install.

What I learnt:

  • Python is a case-sensitive language. (F/flask) Yeah it is :)
  • use sudo command only when needed.
  • I also learnt to login to root both via terminal and gui.


Unicode .. Devanagari.. Transliteration …10 Apr'13

Today, I had a hectic but good day.

I tried to learn what transliteration is all about and how it works.

It is done using CMUDict, a pronunciation dictionary. If we have this word:

BENGAL B EH NG AH L

We have this mapping in the dictionary. This mapping is used to find its equivalent which is language specific. We then modify it according to the construct of the language and form a proper transliterated word.


I also made ‘hindi_english_dict’, ‘hi_vowels’, ‘hi_vowel_signs’ dictionaries, which are basically, sound mappings from hi to en_IN.

I also learnt about various unicode symbols of hindi like CANDRABINDU, ANUSVARA, VISARGA, VIRAMA etc. and their similar counterparts in kannada, telgu and malyalam.

Thanks to my friends for helping me here.

These websites also helped me a lot:

  • people.w3.org/rishida/scripts/uniview.fr/chars-devanagari.html
  • en.wiktionary.org/wiki/Appendix:Unicode/Devanagari
  • www.infowebservices.in/hindi/
  • en.wikipedia.org/wiki/Devanagari and
  • jrgraphix.net/r/Unicode/0900-097F

There is a nice list of devanagari characters also, in pdf format: www.unicode.org/charts/PDF/U0900.pdf

Finally, सभी को धन्यवाद!


Yeah! now it transliterates:12 Apr'13

Today, I added following to my copy of silpa code:

  • File: cmudict.py

_fix_vowel_signs_hi: This function replaces the vowel symbols with corresponding vowel signs. For example: "आ" -> "ा", "ए" -> "े" etc.

  • File: cmumapping.py

CMU_HINDI_MAP: This is basically a dictionary which maps 39 phenomes to their corresponding symbols in Hindi. For example: "AY" -> "ऐ" etc.

  • File: core.py

transliterate_en_hi: To translate en to hi, I added this function which made calls to cmu dict to transliterate. This was similar to what was written for ml and kn.

  • File: indic_en.py:

Improved 'hindi_english_dict', 'hi_vowels', 'hi_vowel_signs' dictionaries and added some more mappings to them.

Then, I verified the output and this is what I got:

  • print a.transliterate_en_hi('dog') -> "डौग"
  • print a.transliterate_en_hi('button') -> "बटन"
  • print a.transliterate_en_hi('desk') -> "डैसक"

Initially, it seemed ok, But later, I found out some mistakes:

  • print a.transliterate_en_hi('apple') -> "ेपल"
  • print a.transliterate_en_hi('english') -> "िन्गलिष"

Mistakes

  • I had not implemented normalization yet.
  • Some words like raam etc. were not present in the dict.
  • Some phenomes were missing. For example:

print a.transliterate_en_hi('boy') gave output बाOY We can find list of phenomes here www.speech.cs.cmu.edu/cgi-bin/cmudict

  • Any mistakes yet to find :)

Now I know better, how the code works. I also know where to move on.

Areas where my code let me feel the difference:

  • At silpa.org.in/Transliterate "book" transliterated to "ब्क " (bk), whereas my code transliterated it to "बुक" (buk).
  • "because" transliterated to "बिकोस" (bikoss) whereas my code transliterated it to "बिकौज़" (bicauz)

Areas to work on:

  • Add normalization
  • add missing phenomes
  • think on how to do it better, one way can be, to guess the pronunciation of the word if it is not present in CMU Dict.

For an informal version, see: sinhayash.wordpress.com/


Uploaded Application:2 May'13

I installed the modules using Virtualenv and virtualenvwrapper.

My application is at wiki.smc.org.in/User:Yash/Application


CLDR Tools with Babel:4 May'13

Babel can be used to add CLDR tools to silpa.

  • Babel is a collection of tools for internationalizing Python applications. It has a python interface to the CLDR (Common Locale Data Repository), providing access to various locale display names, localized number and date formatting, etc. [1]
  • It can be integrated with Jinja2. [2]
  • It can enhance the transliteration system in a variety of ways like
Locale Display Names
Translation of countries
Translation of calendar (months etc.) [3]

[1] babel.edgewall.org/ [2] jinja.pocoo.org/docs/integration/#babel-integration [3] babel.edgewall.org/wiki/Documentation/display.html