Google Summer of Code 2013 Proposal for Swathanthra Malalayalam Computing
- 1 Personal Information
- 1.1 Why do you want to work with the Swathanthra Malayalam Computing?
- 1.2 Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?
- 1.3 Did you participate with the past GSoC programs, if so which years, which organizations?
- 1.4 Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment
- 1.5 Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?
- 1.6 Why should we choose you over other applicants?
- 2 Proposal Description
- 2.1 An overview of your proposal
- 2.2 The need you believe it fulfills
- 2.3 Any relevant experience you have
- 2.4 How you intend to implement your proposal
- 2.5 A rough timeline for your progress with phases
- 2.6 Tell us something about you have created
- 2.7 Have you communicated with a potential mentor? If so who?
- 2.8 SMC Wiki link of your proposal
Email Address ː firstname.lastname@example.org Blog URL ː http://velmurugancse.blogspot.in/ Freenode IRC Nick ː velraam University and current education ː B.E Computer Science, Anna University Chennai. Sri sivasubramaniya nadar college of engineering.Chennai.
Why do you want to work with the Swathanthra Malayalam Computing?
SMC is involved in Indic language computing though it focuses on Malayalam.
I want to pay my contribution to my mother tongue Tamil through my programming knowledge.
Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?
No.I haven't contributed to open source projects before.
Did you participate with the past GSoC programs, if so which years, which organizations?
No, I've not participated in GSoC programs before.
Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment
I will be able to work 40 hours a week starting from May.
Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?
I will definitely continue contributing along with maintaining my project after GSoC. My interested area includes Tamil computing projects in specific to Tamil grammar
Why should we choose you over other applicants?
I'm an open source enthusiast took part in many open source events and activities conducted by small groups
such as ILUG-C (Indian Linux users group chennai).
I took part in workshop conducted by wikipedia and mozilla in our college campus.
I've attended monthly meetings of ILUG-C frequently.
I want a chance to prove my self through this project.
Moreover I have sufficient knowledge in Tamil grammar and passion to serve my language.
The current spell checker lacks some features in Tamil. The proposed project aims to make it work with linux based system so that we can serve the open source community and also to work with applications such as Libre office. currently it have some issues in it. So those issues need to fixed.
An overview of your proposal
The proposed project aims at building more advanced spellchecker which works for multilevel suffix stripping as required for Tamil .
It will also have features to handle inflections and agglutinations in Tamil .
The plan is to use hunspell algorithm.
For this hunspell files has to be written for Tamil , after scripting data needed for files.
Hunspell supports two fold suffix stripping by extension.
If it does not support five level suffix stripping a python based solution has to be found out.
In order to write files for hunspell,Tamil grammar system and it's rules for word combinations has to be studied in detail.
The need you believe it fulfills
Apart from basic spell checker SILPA provides, hunspell algorithm will help people to use spell checker more efficiently.Implementing hunspell algorithm has following benefits :
- Performs spell checking of complex words
- Spell checks inflecting and agglutinating words
- Spell checks multi level suffix stripping words
- Give suggestions for the complex words
- Handle conditional affixes
- Support complex compoundings
Tamil, like other Dravidian languages, is an agglutinative language. Tamil words consist of a lexical root to which one or more affixes are attached. Most Tamil affixes are suffixes. Tamil suffixes can be derivational suffixes, which either change the part of speech of the word or its meaning, or inflectional suffixes, which mark categories such as person, number, mood, tense, etc. There is no absolute limit on the length and extent of agglutination, which can lead to long words with a large number of suffixes, which would require several words or a sentence in English. To give an example, the word pōkamuṭiyātavarkaḷukkāka means "for the sake of those who cannot go", and consists of the following morphemes:
pōka muṭi y āta var kaḷ ukku āka
go accomplish word-joining letter negation (impersonal) nominalizer he/she who does plural marker to for
Words formed as a result of the agglutinative process are often difficult to translate. According to Today Translations, a British translation service, the Tamil word "செல்லாதிருப்பவர்" (sellaathiruppavar, meaning a certain type of truancy †) is ranked
8th in The Most Untranslatable Word In The World list.
Any relevant experience you have
No. I don't have any experience but I have the confidence to lean those thing quickly as possible.
How you intend to implement your proposal
Hunspell is a spell checker and program designed for languages with complex word compounding or character encoding.It uses terminal like interfaces. Hunspell algorithm can be implemented for spellchecker.
In hunspell algorithm two files are used for spell checking.
1. A dictionary file containing list of words in Tamil. The first line of the file contains word count. Each word can optionally be followed by a slash('/') and flags which represents affixes. A second word separated by a slash sets the affixation.
2. An affix file which contains optional attributes.Some of these attributes are
REP - It sets a replacement table for multiple character corrections for suggestions. It is not applied for the correct word. The first REP is header which gives the number of REPs used followed by REPs from the next line. REP can be used if right word forms differs by 1 or more characters.
PFX – It defines prefix
SFX – It defines suffix
TRY – It sets change characters for suggestions.It is not applied for correct word.It suggests the right word forms
There are also options for compounding of words in hunspell. The compound header gives number of compound definitions. The words can be first,middle,last,only middle etc elements in compound words. For this flags are defined in affix file and it is used in dictionary file.
Hunspell also supports two fold suffix stripping for agglutinating languages. Single suffix stripping is extended for this purpose. Hunspell can also handle many affix classes. The Hunspell provides library routines which gives the user word-level linguistic functions : spell checking - spell() and correction -suggest(). It’s constructor needs paths of the affix and dictionary files.
We have to make these two files with necessary data for Tamil spell checker.Suffix stripping can be extended to achieve any multilevel suffix stripping.If it doesn't work in hunspell, an algorithm to be implemented in python has to be found out.Help can be sought from language communities if needed for scripting.The dictionary file is already written for silpa. The main task will be creating an affix file in hunspell for Tamil. SinceTamil is an agglutinating language spell checking program can be more complicated.
A rough timeline for your progress with phases
|Before May 27||Before Announcement of Candidates||Familiarize with hunspell working and requirements of project more.Try spell checker in silpa and hunspell in malayalam for different compound words. Learn malayalam grammar system in detail.|
|May 28 – June 16||Before Official Coding Period Starts||To do self coding with python to further improve my understanding of various concepts involved.Start learning hunspell algorithm.During this period I will remain in constant touch with my mentor to be absolutely clear of my future goals.|
|June 17 – July 3||Official Coding Period Starts||Coding,Testing and Debugging of various features in spell checker.
Starts scripting of various suffix and prefix patterns in Tamil. Learn more about Tamil grammar system.Learn exceptions that occur in the language. Start writing affix file for hunspell. Presentation of components to mentor weekly.
|July 3 - July 31||Preparing for mid term evaluation.||Learn different classifications of suffixes in detail.Learn about how multi level suffix stripping rules. Scripting of words based on the rules learnt. Ask help from language communities for further scripting. Implement hunspell algorithm.Submission of files to mentor for evaluation.|
|Aug 1 - Aug 15||After mid term evaluation||Refine the scripting as per mentors suggestions.
Solution for various non standard systems in Tamil Scripting for multi level suffix stripping. Making changes so as to improve functionality.
|August 16 - August 29||Before Final stage||Implement multi suffix stripping property.
Completion of affix and dictionary files and implementation. Most of the time will be used for rigorous testing.
|August 30 - September 10||Final Stage||Documentation of the project.
Tell us something about you have created
NO. I haven't created any successfull projects yet.
Have you communicated with a potential mentor? If so who?
NO. I've not communicated with Santhosh Thottingal(mentor).