User:Preethy

Google Summer of Code 2013 Proposal for Swathanthra Malalayalam Computing

Personal Information
Email Address                   ː preethyvarma@gmail.com Blog URL                        ː http://myspaceoflearning.blogspot.in/ Freenode IRC Nick               ː preethy University and current education ː MTech, International Institute of Information Technology, Bangalore

Why do you want to work with the Swathanthra Malayalam Computing?
I am inspired and motivated by the motto of Swathanthra Malayalam Computing, "My language for my computer". Since Malayalam is my mother tongue, it would be of great inspiration and personal satisfaction for me if I can contribute my bit of work to realize the wonderful idea proposed by this organization.

Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?
No.I haven't worked with any open source projects before.

Did you participate with the past GSoC programs, if so which years, which organizations?
No, I have not participated in GSoC programs before.

Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment
I have a brief summer semester in June and July, but I will not be very much overloaded with course work. Hence I can work 40 hours in a week. Classes begin for me in August owing to which I might have to reduce the working hours to 30-35 hour/week.

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?
I am very much interested in working in the area of Natural language processing and computational linguistics especially in Indic Languages. So I will be continuing the project with SMC after GSoC programme.

Why should we choose you over other applicants?
I have been working in the area of natural language processing since September 2012. The projects which I have been working on are basically aimed for Sanskrit Language, the grammar system of which corresponds very closely with Malayalam. So, I have a good knowledge about the semantics of the language and a basic understanding about the open office softwares. My academic projects and research electives in the next semester are also in Natural Language Processing and word processing. Thus I can promise that I will work on this project with complete enthusiasm and interest.

An overview of your proposal
Malayalam is a very rich language, the grammar rules of which allows to form words by combining multiple words. This feature is present in most of the Indic languages. This project aims to construct an advanced spellchecker which can perform multilevel suffix stripping. It should be able to handle the complex properties of inflections and agglutinations in Malayalam. Present spell checker lacks these features. Hunspell algorithm will be used to implement the project. It requires that hunspell files to be written for Malayalam. Hunspell supports two fold suffix stripping by extension. If it does not support five level suffix stripping a python based solution has to be found out. In order to write files for hunspell, rules for word combinations has to be studied in detail.

The need you believe it fulfills
Using the Spell Checker which will be developed, the accuracy and efficiency of the application will be very much improved. Compound word analysis and synthesis could be made easier and extended to multiple words, which is very common in the literature. It can pave way to develop more complex language analysis applications which include semantic analysis as well.

Any relevant experience you have
I was a part of the team which developed a basic compound word analyser in Sanskrit using a rule based system and tagged data. This was done in Java. We also developed a "Vibhakthi generator" application for Sanskrit. Also, we developed a semantic dictionary and querying system for Sanskrit language. I am working on making the system better and more extensible. These tasks have been done in Java using OWL API and Jena.

How you intend to implement your proposal
As stated above, the project will be developed by implementing Hunspell algorithm. Hunspell is a spell checker program developed for languages having rules which allow multiple and complex compounding. It supports UTF 8 encoding. Since Malayalam is also a language that has complex compounding rules, Hunspell will work for it.

Hunspell requires two files. 1. Dictionary file in a specified format. 2. Affix file with optional attributes.

So briefly the steps would be as follows.

1. Gaining more knowledge in Hunspell.

2. Get clean data files customized for Malayalam.

3. Try multilevel suffix stripping using hunspell. If it does not work, an algorithm will have to be developed.

4. Have to decide on how to generate the suffix and prefix pattern for each word in Malayalam. This requires a good knowledge in the language grammar rules.

5. Sandhi rules are also studied systematically.

6. Vibhakhi generator for nouns and Kriyapada generator for verbs may also be required.

Tell us something about you have created
1. Basic rule based compound word analysis and synthesis in Sanskrit.

2. Semantic dictionary structure for Sanskrit: This is in progress

Have you communicated with a potential mentor? If so who?
Yes, I have e mailed my mentor, Mr. Santhosh Thottingal.

SMC Wiki link of your proposal
http://wiki.smc.org.in/User:Preethy/