User:Karthikd

From SMC Wiki

Personal information

Name : Karthik Damarsetti

Email : d.vinaykarthik@gmail.com

Blog URLhttp://projects4summer2014.blogspot.com/

Freenode IRC nick : karthikd

Current Education : Final year of Master's in Computer Science at California State University Long Beach (CSULB), CA, USA.

Why SMC : The SMC is doing a commendable job at making the Dravidian languages computer / tech friendly. This would allow for users to use these languages on computers and computer projects and on technological devices seamlessly and would enjance productivity of individuals who are more at ease with the vernacular than with 'universal' languages such as English. Moreover, this effort of the SMC helps the youth learn more of their native language and help enrich and preserve the languages of India.

Other commitments : Currently, I don't have any prior commitments during the period of the GSoC 2014.

Why me : I have a keen interest in both computer programming languages as well as vernacular languages. I am proficient in C/C++ , Java on the technological front and in English,Hindi,Tamil,Telugu and have basic knowledge of Kannada among on the verncular languages. Working with the SMC provides me with a unique opportunity which allows me to unite the languages' knowledge while allowing me to learn new ones such as Python/Ruby on one hand and Malayalam on the other.

Past Experience with GSoC or SMC : I don't have any prior experience with either the GSoC or the SMC.

Continued Support to the SMC projects : I can't guarantee if I may be able to provide continued support to the SMC after the Google Summer of Code 2014.

Proposals

I would like to present 2 proposals for consideration. I have added a section for relevant personal information.

Project Relevant information

Relevant Experience I don't have any NLP experience. But I have textual analysis experience from my coursework at my University. I have taken two courses on Computational Linguistics Theory.

Previous Projects I haven't developed an NLP-based project. Though I have developed projects where Semantic Analysis has been performed on Google Books and the semantic trend of the English Language was plotted.

SMC Wiki Proposal Link My proposals can be found on http://wiki.smc.org.in/User:Karthikd

Improve the Varnam learning system

Proposal Overview My proposal is to develop an algorithm that helps in refining the tokens produced by the Varnam transliteration engine.

Needs Fulfilled Currently, the engine transliterates the text based solely on the input characters. As an effect, some of the text gets incorrectly transliterated even though the input is correct.An effective learning algorithm can prevent such errors. A refined Varnam engine can produce results more efficiently, thus enabling the Varnam engine to be used in other areas, as required.

Rough Implementation details

I propose that an algorithm be developed so that incorrect words are seldom suggested to users, by use of a ranking method. The learning algorithm can be improved by analysis the 'correct' words in the language vocabulary. When the learning algorithm can detect a certain word is correct or more frequent, it provides a higher rank to the word during the transliteration. This enables a near-perfect output, thus increasing the efficiency of the engine. Performing a textual analysis of the language enables the tagging of word stems and thus boosting the learning process of the varnam engine as a whole.

Estimated Project Timeline

April 6 - April 21 : Learn Python, Ruby (as required)

April 21 - May 13 [Community Bonding] : Study the Varnam Engine API and Discuss pseudo algorithms for Textual analysis and Ranking

May 15 - May 20 : Get corpora for performing text mining for the analysis/learning.

May 21 - June 15 : Develop a textual analysis algorithm - Learning with Stemming capacity.

June 16 - June 22 : Debug code in preparation for Mid-Term Evaluation

June 23 - Mid-term evaluation submission and project progress update.

June 24 - June 30 - Develop the ranking algorithm

July 1 - July 19 - Integrate the Learning algorithm with the ranking algorithm.

July 20 - July 31 - Test/Debug code

Aug 1 - Aug 10 - Finalize the documentation

Aug 11 - Aug 18 - Final touches

Aug 18 - Aug 22 - Submit code to Google


Communication with Mentors

I haven't communicated with any project mentors, yet.

Spell checker for Indic language that understands inflections

Proposal Overview Indic language inflexions are based on the gender, and the count of the noun in the sentence. The inflexions can be tabulated and this table can be used to develop the spell-checker, once the word stems can be computed from the input text. An effective learning algorithm also needs to be designed which the spell-checker can use as a dictionary.

Needs Fulfilled A learning algorithm coupled with knowledge of the possible word stems and the legitimate inflextions can go along way in improving the effectiveness of the spell-checker.

Rough Implementation details A learning algorithm has to be developed, which has the list of inflexions based on the gender/count of the word. A word stemming algorithm needs to be designed to tackle the issue of agglutinated words in the language.

Estimated Project Timeline

April 6 - April 21 : Learn Python, Ruby (as required)

April 21 - May 13 [Community Bonding] : Design a pseudo algorithm for the spell-checker with stemming.

May 15 - May 20 : Analyze the algorithm's effectiveness.

May 21 - June 30 : Code the algorithm.

June 23 - Mid-term evaluation submission and project progress update.

July 1 - July 31 - Test/Debug code.

Aug 1 - Aug 10 - Finalize the documentation

Aug 11 - Aug 18 - Final touches

Aug 18 - Aug 22 - Submit code to Google


Communication with Mentors

I haven't communicated with any project mentors, yet.