User:Lonesword/gsoc 2014 proposal

From SMC Wiki

Personal Information

Name : Kevin Martin Jose

Email : youcancallmekevin@gmail.com

Phone : +917736693833

Blog : http://kevincoder.wordpress.com/

Personal website : http://www.kevinkoder.tk

Freenode IRC nick : lonesword

github handle : lonesword

current education : 3rd year computer science, college of engineering Thiruvananthapuram

Previous involvement with SMC: Improved freedom toaster [4] as part of an SMC hackathon conducted at Thiruvananthapuram. It was a part of Praveen A's and Sooraj Kenoth's cycling campaign. I was a volunteer when SMC celebrated 'vyazhavattom' at Thrissur in 2013.

Other projects : Contributing to Nupic - numenta platform for intelligent computing [3]

Why SMC : A GSoC mentoring organization from my place is something to be proud of. Moreover, any work I do as a part of SMC will benefit the residents of my country and is my chance to bond with the community.

Other commitments : I will be having university exams starting by April end or start of May and will end by the last week of May. My classes will reopen after semester break by mid July. I believe the milestones I have set have sufficient time between them and I'm sure that I can manage my time effectively during my exam, and that I can devote 40 hours to the project per week.

I will contribute to SMC even after GSoC 2014. I'm interested in the varnam project, and something that involves user interaction. I love working with processing.


Why me : I meet the skills needed for the project (c and basic ruby) and have done a thorough preliminary research as is demonstrated by this issue[8]. I'm geographically close to SMC and is fluent in Malayalam - which is necessary to draft the stem rules (initial draft here [1] ) and come up with a general algorithm. I'm passionate about open source and is a linux enthusiast. A small portfolio can be found at my personal website [5].


Summary

Varnam is a transliterator for indic languages that can convert English phones to corresponding patterns in an indic language. That aim of this project is two-fold :

1) Implement and integrate an algorithm that can infer the base word when a complex word is encountered into the varnam framework.

2) Implement a queuing system to improve concurrency support for the learn function


Motivation

In a country like India, being able to interact with a computer in one's native language is of paramount importance. A good majority of the country's residence are not English educated and hence are unable to comprehend the latin characters found in most interactive applications. Varnam is a project that has stepped in to solve transliteration woes and being an Indian, I understand the need for a solid transliterator. Currently varnam supports just two languages - Malayalam and Hindi. Support for other indic languages are of high priority. However, the current learning mechanism needs to be improved prior to adding support for other languages.


The problem

1) Today when a word is learned, varnam takes all possible prefixes into account and learns all of them to improve future suggestions. For example, assume that the word മലയാളം ('malayalam') is supplied to varnam. Varnam tokenizes it and learns the patterns മല, മലയാ, മലയാള ('mala', 'malaya', 'malayala') e.t.c so that any complex word that has one of these prefixes as the base word can be easily predicted : as in മലയാളി ('malayali') which has മലയാള ('malayala') as the base word. However, the patterns corresponding to മല and മലയാ e.t.c are almost never encountered independently but only മലയാള is encountered frequently. Hence it was wasteful for varnam to learn the first two prefixes since this makes the database larger and searches longer.


2) The learn function cannot execute concurrently. This is due to a limitation with SQLite. For example, if a training session is in progress and words are being learned from a text file, any calls to varnam_learn() from a client will result in an error since the whole database file will be locked when a transaction is in progress. Hence all requests to varnam_learn() will be lost. Some mechanism is required which will allow these failed requests to be re-executed when the database is not locked.


The solution

1) The solution to this problem is to infer the base word when a complicated word is encountered. Thus when മലയാളം is encountered the base word മലയാള is inferred and only this base is learned by varnam. The base word can be obtained using a stemming algorithm. Hence the objective is to develop a malayalam stemmer, similar in concept to that of the English stemmer[1]. The stemming algorithm uses a set of stem rules to convert the complex word into its base form. Each language will have its own set of stem rules. In order to avoid a separate algorithm for each language, the stem rules will be specified in the scheme file. These rules will be compiled with the scheme file to an SQLite database and will be used by an algorithm specified in learn.c. Doing so will make it easier to add support for other indic languages since it separates the algorithm from the stem rules. I have made an initial draft of the malayalam stem rules here[2].


2) To save the requests to varnam_learn() when the database file was locked, a temporary database will be implemented. When the database is no longer locked, the contents of the temporary database will be read and fed into varnam_learn() to be stored in the primary database.


Milestones

Required Deliverables :

a. May 25 : Finish drafting stem rules for Malayalam

b. June 5 : Decide an ordering of the stem rules and test it. Correct ordering is required so as to avoid recursion in the algorithm. A sample malayalam text (wikipedia source) will be used to test the stem rules.

c. June 10 : Add stem rules to scheme file. Compile and test it by calling stem rules on individual Malayalam words.

d. June 15 : Finish drafting the stemming algorithm.

e. June 25 : Implement the algorithm in learn.c

f. July 10 : Perfect the algorithm after testing on a text corpus.

g. July 20 : Implement and test a simple queue using a temporary database to store patterns that were to be learned when the primary database was busy.

e. July 30 : Code cleaning, documentation.


Nice to haves :

1. A detailed tutorial on how to add support for another indic language.

2. Stem rules for Hindi - allows Hindi to use the improved learning mechanism.

3. Help with incomplete documentation.


I have communicated with the mentor, Navaneeth K N , about the project and he was very helpful. I developed an interest in machine learning and linguistics only recently - when I came to know about Numenta and their project Nupic. I was mainly interested in graphical applications before that and most of the projects I did was using processing and python/pygame. I had created a paint application in processing [6] and a conway's game of life simulator in pygame [7].


References

1. http://snowball.tartarus.org/algorithms/english/stemmer.html

2. https://github.com/lonesword/mlstemmer/blob/master/stemmer_rules

3. https://github.com/numenta/nupic/pull/722#issuecomment-37193358

4. https://github.com/aneesh/freedomtoaster/pull/1

5. http://www.kevinkoder.tk

6. http://www.kevinkoder.tk/paint7.htm

7. http://kevincoder.wordpress.com/2013/01/17/living-the-life/

8. https://savannah.nongnu.org/bugs/?40714