Improving cross language transliteration system (Overview)
My proposal for Google Summer of Code 2013 for Improving cross language transliteration system. The project is to improve the cross language transliteration system by adding support of Hindi, as an intermediate language. It also aims to use the tools of CLDR project to improve the transliteration system. It may be extended to Kaithi language.
Who are you?
- Name : Yash Sinha
- University: Birla Institute of Technology & Science, Pilani, Rajasthan.
- College: Birla Institute of Technology & Science, Pilani Campus, Pilani.
- Current Education: M.Sc. (Tech.) Information Systems. (2012-2016)
I am Yash, pursuing Information Systems at BITS Pilani. I am from Hazaribag, a small town in Jharkhand, India. I am a geek who does what I desire.
I started to learn Java during my school days, and in 2010, built a really cool application Dhamaal Calculator, which apart from computing numbers, had features like Splash screen, Hindi/Sanskrit language support, System tray icon etc. I learnt C and C++ to increase the speed of my programs for coding challenges. I did qualify for the final round of ACM-Indian Coding League held at BITS, Pilani as a part of APOGEE, technical fest of BITS, in 2012. After that, I learnt Python, matplotlib (plotting library) and wxPython(UI library). I got a certificate from Massachusetts Institute of Technology for the course at edx.org. Last month, I learnt about OpenGL on edx.org.
I like chemistry, interested in surface phenomenon. I completed a project on how to adsorb carbon from ambient atmosphere by capturing it at the surface of a resin, which was selected for CBSE National Level Science Exhibition at New Delhi. year
I also carry a legacy of Indian Classical Music from my family. I am a tabla player and I have performed at National Level Youth Festivals more than five times. I graduated (Sangeet Prabhakar) in Tabla from Prayag Sangeet Samiti, Allahabad.
- GitHub Username: yash-sinha - Email: email@example.com - IRC Handler: #sinhayash - Blog: sinhayash.wordpress.com
What is your programming experience?
1. What platform do you use to code? What editor do you prefer and why?
I use both Windows and Linux(Ubuntu 12) operating systems. Java and Python are my favourite languages, but I am also good at C, C++. For C and C++, I use CodeBlocks and Visual Studio, whereas I prefer Ninja/IDLE for Python and use NetBeans for Java.
2. How good can you use Malayalam and how good is your Malayalam reading and typing skills? I have no experience in Malayalam. I do have friends who are well versed in Malayalam, who can help me understand the script, if need arises.
3. Tell us about something you have created.
- ii. In class 11th, I created an application called Dhamaal Calculator in Java. It had all features of a scientific calculator. It also had features like background changer, look and feel selector, support for Hindi & Sanskrit languages, splash screen and system tray icon.
github.com/yash-sinha/Dhamaal-Calculator Run file: run.jar
- iii. I created a Hangman game in Python in which a player thinks of a word and the other tries to guess it by suggesting letters or numbers.
- iv. I also developed a Word Scrabble game in Python in which the player had to form meaningful words from a given pool of letters and he scored points based on that. It also had the option to play with computer.
- v. I made a simulation program in Python, which stochastically determined virus population in a patient’s body and plotted graphs using the data obtained.
- vi. Currently, as a part of my summer project, I am working on a protein filter, in which I am making a website. It aims to search from a pool of experimentally-determined protein structures, a particular protein sequence and filter them according to their release date. It is nearly completed.
4. What makes you excited about SMC? Have you worked before with SMC or another open source project as a contributor? If yes, when and on what?
I have not worked with any other open source project before. I didn't contribute to SMC formally, however did some minor contributions and solved the issue raised for GSoC beginners. From my schooldays, I have been enthusiastic about non-English language support in computer applications. This led me to add Hindi and Sanskrit support in Dhamaal Calculator. SMC’s goal of upbringing Indic languages and finally, come up with a language module for Python community appeals me. This is the main reason, why I would like to work with SMC. I would like to learn and contribute to SMC, even if I do not get selected for GSOC.
5. Have you ever used git or any other version control system?
I have not used git previously. After announcement of GSoC I used it while setting up silpa repository and learning it quickly.
1. Did you participate with the past GSoC programs, if so which years, which organizations?
No, this is my first attempt.
2. Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment.
No between May and August. I plan to work 6~7 hours every week and take Sunday off. (7*6 = 42 hours)
3. Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?
Yes, I would like to contribute to improve the transliteration system.
4. Have you communicated with a potential mentor? If so, who?
I have communicated with Vasudev Kamath (copyninja on IRC).
5. SMC Wiki link of your proposal
Why should we choose you over other applicants?
I have been enthusiastic right from the beginning on IRC regarding silpa project. I started discussing and learning about it from the very first hour SMC was declared as mentoring organization by Google.
I have gained a fair knowledge of silpa source code and I love to code in Python. I have already cloned the repository (with some initial hiccups) and even tried to add Hindi dictionaries and normalizations, the details of which I have posted on Wiki(wiki.smc.org.in/User:Yash#Yeah.21_now_it_transliterates:12_Apr.2713) and also on my blog. The code is available at github.com/yash-sinha/Transliteration I also solved the issue raised for the GSoC beginners.
I look forward for a lifetime achievement to work on such a project and ultimately benefiting the Python community.
What is your project?
I want to improve the cross language transliteration system by adding support of Hindi, as an intermediate language. I would also try to use the tools of CLDR project to improve the transliteration system. It may be extended to Kaithi language.
Similar to Malayalam and Kannada I will implement the transliterate functions in Hindi. I will include features of Hindi script like Chandrabindu, Chandra, Anusvara, Visarga, Rra, Llla, Udatta, Anudatta, Danda and Za. I would also add normalizations to include sounds like au (as in English word: awe), da (as pahaad in Hindi) and Om. At present, the project is devoid of these things.
I will also figure out how to use CLDR tools to improve the transliteration system. I will try to incorporate Levenshtein’s Edit Distance algorithm and carry out transliteration in a better way.
If time permits, I will try to add transliterations to Kaithi language too.
The need you believe it fulfills
- Adding Hindi as an intermediate language will improve the transliterations on silpa. Many of the syllables of Hindi language which are transliterated using Malayalam or Kannada are deformed. Sounds of au(awe) and symbols like danda are missing. I will try to improve it on these dimensions.
- It will help to transliterate to Kaithi language because it is similar to Hindi language. I plan to implement this , if time permits.
- Transliteration in Hindi will open doors for transliteration in many regional languages like Bhojpuri, Maithili, Magahi etc.
Before May 3 (Application deadline):
- Setup required environment at my workplace. ✓
- Join mailing list ✓
- Setup my blog and WikiPage. ✓
- Know the community and its working style.
- Familiarize myself with the code of the project, the documentation and test system used.
May 3 – May 27:
- Know the git version control system.
- I will try to add Hindi dicts and Hindi transliteration functions in the module to improve my further understanding of the code.
May 28 – June 17 (Before the official coding time):
- Get familiar with python modules flask, Jinja2, Werkzeug and Virtualenv.
- I will utilize this time to discuss and finalize the changes (if any) on existing set of deliverables.
- During this period, through IRC and Mailing lists, I will be in touch with my mentor and the community and become absolutely clear about the desired final results.
June 18 – July 2 (Official coding period starts):
- Add Hindi as an intermediate language for transliteration.
- Add dictionaries for vowels, consonants and vowel symbols.
- Add normalizations.
- Finalize on how to use the CLDR tools to improve transliteration system.
- Finalize on transliteration aspects of Kaithi.
July 3 - July 31:
- Prepare for midterm evaluation
Aug 2nd MID TERM EVALUATION
Aug 2 – Sep 2:
- Add CLDR tools to transliteration system.
- Add Kaithi language.(optional)
- Write a detailed test suite for testing.
- Final review.
- For Documentation
- I have kept a buffer of two weeks for any unpredictable delay.