From SMC Wiki

Google Summer of Code 2013 Proposal for Swathanthra Malalayalam Computing

Personal Information

Email Address                    ː
Blog URL                         ː
Freenode IRC Nick                ː preethy
University and current education ː MTech, International Institute of Information Technology, Bangalore

Why do you want to work with the Swathanthra Malayalam Computing?

I am inspired and motivated by the motto of Swathanthra Malayalam Computing, "My language for my computer". Since Malayalam is my mother tongue, it would be of great inspiration and personal satisfaction for me if I can contribute my bit of work to realize the wonderful idea proposed by this organization.

Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?

No.I haven't worked with any open source projects before.

Did you participate with the past GSoC programs, if so which years, which organizations?

No, I have not participated in GSoC programs before.

Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment

I have a brief summer semester in June and July, but I will not be very much overloaded with course work. Hence I can work 40 hours in a week. Classes begin for me in August owing to which I might have to reduce the working hours to 30-35 hour/week.

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?

I am very much interested in working in the area of Natural language processing and computational linguistics especially in Indic Languages. So I will be continuing the project with SMC after GSoC programme.

Why should we choose you over other applicants?

I have been working in the area of natural language processing since September 2012. The projects which I have been working on are basically aimed for Sanskrit Language, the grammar system of which corresponds very closely with Malayalam. So, I have a good knowledge about the semantics of the language and a basic understanding about the open office softwares. My academic projects and research electives in the next semester are also in Natural Language Processing and word processing. Thus I can promise that I will work on this project with complete enthusiasm and interest.

Proposal Description

An overview of your proposal

Malayalam is a very rich language, the grammar rules of which allows to form words by combining multiple words. This feature is present in most of the Indic languages. This project aims to construct an advanced spellchecker which can perform multilevel suffix stripping. It should be able to handle the complex properties of inflections and agglutinations in Malayalam. Present spell checker lacks these features. Hunspell algorithm will be used to implement the project. It requires that hunspell files to be written for Malayalam. Hunspell supports two fold suffix stripping by extension. If it does not support five level suffix stripping a python based solution has to be found out. In order to write files for hunspell, rules for word combinations has to be studied in detail.

The need you believe it fulfills

Using the Spell Checker which will be developed, the accuracy and efficiency of the application will be very much improved. Compound word analysis and synthesis could be made easier and extended to multiple words, which is very common in the literature. It can pave way to develop more complex language analysis applications which include semantic analysis as well.

Any relevant experience you have

I was a part of the team which developed a basic compound word analyser in Sanskrit using a rule based system and tagged data. 
This was done in Java. We also developed a "Vibhakthi generator" application for Sanskrit.
Also, we developed a semantic dictionary and querying system for Sanskrit language.
I am working on making the system better and more extensible. These tasks have been done in Java using OWL API and Jena.

How you intend to implement your proposal

As stated above, the project will be developed by implementing Hunspell algorithm. Hunspell is a spell checker program developed for languages having rules which allow multiple and complex compounding. It supports UTF 8 encoding. Since Malayalam is also a language that has complex compounding rules, Hunspell will work for it.

Hunspell requires two files. 1. Dictionary file in a specified format. 2. Affix file with optional attributes.

So briefly the steps would be as follows.

1. Gaining more knowledge in Hunspell.
2. Get clean data files customized for Malayalam.
3. Try multilevel suffix stripping using hunspell. If it does not work, an algorithm will have to be developed.
4. Have to decide on how to generate the suffix and prefix pattern for each word in Malayalam. This requires a good knowledge in the
language grammar rules.
5. Sandhi rules are also studied systematically.
6. Vibhakhi generator for nouns and Kriyapada generator for verbs may also be required.

A rough timeline for your progress with phases

Duration Description Mile Stone
Before May 27 Before Announcement of Candidates Literature study and self practise. Understanding the version control system,code,documentation of spell checker module of SILPA, hunspell algorithm. Studying the suffix and prefix pattern for Malayalam.
May 28 – June 16 Before Official Coding Period Starts Learning more on Hunspell and python.
June 17 – July 3 Official Coding Period Starts Coding,Testing and fixing of various features in spell checker as advised by the mentor.

Presentation of components to mentor weekly.

July 3 - July 31 Preparing for mid term evaluation. Implement hunspell algorithm for a small data set.

Submission of files to mentor for evaluation.

Aug 1 - Aug 15 After mid term evaluation Refine the scripting as per mentors suggestions.

Solution for various non standard systems in malayalam. Scripting for multi level suffix stripping. Making changes so as to improve functionality.

August 16 - August 29 Before Final stage Implement multi suffix stripping property.

Completion of affix and dictionary files and implementation. Most of the time will be used for rigorous testing.

August 30 - September 10 Final Stage Documentation of the project.

Tell us something about you have created

1. Basic rule based compound word analysis and synthesis in Sanskrit.
2. Semantic dictionary structure for Sanskrit: This is in progress

Have you communicated with a potential mentor? If so who?

Yes, I have e mailed my mentor, Mr. Santhosh Thottingal.

SMC Wiki link of your proposal