User:Priyapappachan/GSoC-spellchecker

Google Summer of Code 2013 Proposal for Swathanthra Malalayalam Computing

Personal Information
Email Address                   ː priyapappachan@gmail.com Blog URL                        ː http://priyapappachan.wordpress.com/ Freenode IRC Nick               ː pratyas University and current education ː BTech Computer Science, Calicut University

Why do you want to work with the Swathanthra Malayalam Computing?
Swathanthra Malayalam Computing is contributing much to Indic languages and language computing.Malayalam is my mother tongue and I would love to see the language more on the web and to provide language tools.It's great to be part of this project.

Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?
No.I haven't got a chance to contribute to open source projects before.

Did you participate with the past GSoC programs, if so which years, which organizations?
No, I've not participated in GSoC programs before.

Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment
I will be able to work 40 hours a week starting from May.

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?
I will definitely continue contributing along with maintaining my project after GSoC. My interested area includes web developments and FLASK based projects.

Why should we choose you over other applicants?
I m a FOSS enthusiast for years and have been a part of FOSS cell in my college for the past 2 and a half years. I have taken sessions on FOSS and bash scripting for juniors. Swanthanthra Malayalam Computing will be a great platform for me to contribute more for FOSS. This proposal is meant for A spell checker for Indic language that understands inflections project. I'm planning to do it for Malayalam in which I have basic knowledge about it's grammar system. I have knowledge about how current spell checker works and how hunspell works for inflections and agglutinations. It will be easy for me to work on that.

An overview of your proposal
The spell checker module for SILPA should be capable of handling inflections and agglutinations. The current spell checker lacks these features in Malayalam. The proposed project aims at building more advanced spellchecker which works for multilevel suffix stripping as required for malayalam. It will also have features to handle inflections and agglutinations in Malayalam. The plan is to use hunspell algorithm. For this hunspell files has to be written for malayalam, after scripting data needed for files. Hunspell supports two fold suffix stripping by extension. If it does not support five level suffix stripping a python based solution has to be found out. In order to write files for hunspell, malayalam grammar system and it's rules for word combinations has to be studied in detail.

The need you believe it fulfills
Apart from basic spell checker SILPA provides, hunspell algorithm will help people to use spell checker more efficiently.Implementing hunspell algorithm has following benefits :


 * Performs spell checking of complex words


 * Spell checks inflecting and agglutinating words


 * Spell checks multi level suffix stripping words


 * Give suggestions for the complex words


 * Handle conditional affixes


 * Support complex compoundings

Any relevant experience you have
I have experience in python in which current spell checker is written and I know algorithm used for the current spell checker.Also I have basic understanding about how hunspell files are written for a language and its algorithm. I also have knowledge in malayalam grammar system.

How you intend to implement your proposal
Hunspell is a spell checker and program designed for languages with complex word compounding or character encoding.It uses terminal like interfaces. Hunspell algorithm can be implemented for spellchecker.

In hunspell algorithm two files are used for spell checking.

1. A dictionary file containing list of words in malayalam. The first line of the file contains word count. Each word can optionally be followed by a slash('/') and flags which represents affixes. A second word separated by a slash sets the affixation.

2. An affix file which contains optional attributes.Some of these attributes are


 * REP - It sets a replacement table for multiple character corrections for suggestions. It is not applied for the correct word. The first REP is header which gives the number of REPs used followed by REPs from the next line. REP can be used if right word forms differs by 1 or more characters.


 * PFX – It defines prefix


 * SFX – It defines suffix


 * TRY – It sets change characters for suggestions.It is not applied for correct word.It suggests the right word forms

There are also options for compounding of words in hunspell. The compound header gives number of compound definitions. The words can be first,middle,last,only middle etc elements in compound words. For this flags are defined in affix file and it is used in dictionary file.

Hunspell also supports two fold suffix stripping for agglutinating languages. Single suffix stripping is extended for this purpose. Hunspell can also handle many affix classes. The Hunspell provides library routines which gives the user word-level linguistic functions : spell checking - spell and correction -suggest. It’s constructor needs paths of the affix and dictionary files.

We have to make these two files with necessary data for malayalam spell checker.Suffix stripping can be extended to achieve any multilevel suffix stripping.If it doesn't work in hunspell, an algorithm to be implemented in python has to be found out.Help can be sought from language communities if needed for scripting.The dictionary file is already written for silpa. The main task will be creating an affix file in hunspell for malayalam. Since malayalam is an agglutinating language spell checking program can be more complicated.

A Classification of words in malayalam is necessary for writing the affix file for the spell checker. This can be done as follows ː
 * Based on Parts of Speech

All words in the vocabulary are listed and are grouped under their respective Parts of Speech (POS). This includes verbs, nouns, suffixes, adverbs etc.


 * Based on behavior on tense changing

The listed verbs are reduced to their root forms. Then these are classified based on the changes that occur due to tense change. There are 13 main types of verbs. The exceptions can be added separately.


 * Based on behavior of nouns

All nouns and pronouns are listed. They are classified based on their behavioral changes due to gender, number markers etc.


 * Suffixes

All suffixes are listed and classified. The classification is based on context where these are used. Some of them are noun suffixes, gender, number etc. We have to find which suffix is added where. The changes due to the addition and deletion of suffix is to be identified. In malayalam combination of suffixes can occur. The order in which these suffixes can occur has to be known for checking correctness of the word.


 * Auxiliary verbs

All auxiliary verbs(verb that determines mood, tense or aspect of another verb) are listed. The classification can be done based on how the auxiliary verbs combines with another word.


 * Formulation of sandhi rules

Changes can occur to words, when two or more words combine to form another word. These are called sandhi(joint) changes. The change happens at the joining position. The joining can cause insertion or deletion of letters in the middle or at the end of the words depending on how joining takes place. Rules for the inflection and agglutination in Malayalam has to be studied.

We can represent malayalam language using a regular expression. The symbols in this regular expression can be expanded in many ways. Based on occurrences of symbols in a word, there are different sandhi rules in the language.

The different classifications in sandhi are ː

- Positional Classification ː There are 3 types which is based on how the word is formed from the regular expression

- Consonant Vowel Pair based classification ː There are 4 types which is based on whether the character where joining takes place is a consonant or vowel.

Also there are also rules for insertion, deletion, duplication and substitution of words. The letters in the words so formed may not be the same as before. But different characters may do these transformations in different rules to form the new word. So each rule has to be written and is to be included in affix file. Also changes can be made to a noun in order to relate it to the other words in the sentence(വിഭക്തി). There are 7 ways in which this change can be made to a noun.Currently 2 are written. The remaining types have to be written and should be used in spell checker module. Hunspell splits the word into valid suffix, root words etc. It then performs the following checks for the words.

It checks whether the word ends with a valid suffix. It it is, suffix is stripped off and later reconstructed in sandhi checking.
 * Suffix checks

It checks whether the word ends with post position. If so, it is stripped off and performs sandhi check.
 * Postposition checks

The stripped word and the base word is joined using sandhi rules. The root word check is performed after sandhi check.
 * Sandhi checks

It checks whether the word is a root word or ends with a root word. The valid word is stripped off for sandhi check in the latter case.
 * Root word check

If there is no error after the above checks the words are passed to next phase where sandhi and suffix rules are applied to ensure the validity of the word. Else it is passed to suggestion module. Challenges ː
 * When words are searched for suggestion the one which has meaning close to the word has to be found out. A change in one letter can change the meaning of the word. The words with meaning close to the given word should be provided first for suggestions. A rule for this is also necessary. One way to do this is to find out the root of the word given and find its inflections. Then this list of words is used for suggesting words if word is not found. Otherwise suggested words may not be related to the given word.


 * The issues between chillaksharam and samvruthokaram.


 * The exceptions that can occur in the sandhi rules. They have to be listed separately in the affix file.


 * Another issue that can occur is that the same word can be written in various ways by different users with slight changes in the letters because there is no unique system in malayalam. An example is ː A word can be split into words using pseudo samvruthokaram(chandrakkala). In order to handle this all words are written in all possible ways with no change in meaning. It has to be included in the affix file. It is difficult because reformations still occur in the language. A better way will be to follow one lipi for the language. But it is also difficult to choose a lipi for our computations.

A rough timeline for your progress with phases
A buffer of one week has been kept for unpredictable delay.

Tell us something about you have created
1.Event Notifier for college Event Notifier is a web app which sends notifications in college on new events.

Git hub repository of the project https://github.com/priyapappachan/eventnotifier/tree/my-remote

Blog post http://priyapappachan.wordpress.com/

Have you communicated with a potential mentor? If so who?
Yes, I've communicated with Santhosh Thottingal.

SMC Wiki link of your proposal
http://wiki.smc.org.in/User:Priyapappachan/GSoC-spellchecker