User:Vidya

From SMC Wiki

Personal Information

Email Address: vidya.vnv@gmail.com

Blog URL: http://www.quieterminal.wordpress.com

Freenode IRC Nick: wannaC

Your university and current education: Netaji Subhas Institute of Technology, Senior Year, B.E. in Information Technology

Association with SMC

Why do you want to work with the Swathanthra Malayalam Computing? Coming from a Malayali family but living far away(in Delhi) where you get less exposure to your culture, it is important for me to find ways in order to associate with it. SMC provides me a platform where I can enhance my skills as well as get acquainted with my native language. Also it gives me an opportunity to work in Open Source with an incentive to acquire more knowledge about Malayalam(my native language). I would like to get associated with an Indian organisation which collaborates with the government which is what SMC does.

Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor? Submitted a patch "Update installation.rst #20" to Swathanthra Malayalam Computing.

Did you participate with the past GSoC programs, if so which years, which organisations? No, this is my first attempt.

Do you have other obligations between May and August ? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment I will be having final year examinations during the month of May but I know how to prioritise my work and will give 100% to it.

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2014 program, if yes, which area(s), you are interested in? Yes. Mainly in Transliteration module and SpellChecker Module.

Why should we choose you over other applicants? I have the necessary programming skills and I am dedicated to the task to which I have committed. Also I believe in team work and want to work in a open source project. I am a quick learner and able to work in challenging situations as well. I am really excited to work on a widely used free software project.

Project Proposal

Converting indic processing modules currently in SILPA into javascript modules library involves porting all the modules in SILPA written in Python to javascript


PURPOSE OF THE PROJECT
Porting algorithms to Javascript gives us the following advantages:
a) Performance b) Straightforwardness


IMPLEMENTATION

Algorithms to be ported:
1. Soundex

2. Transliteration

3. ApproxSearch

4. SpellChecker

5. CharDetails

6. Payyans

7. TextSimilarity

8. Indic Stemmer

9. SILPA Sort


Focus will be on Transliteration, ApproxSearch, SpellChecker and Text Similarity.

For Flexible Fuzzy Search on Webpages:
Bitap algorithm : Using Levenshtein distance, this algorithm performs high speed calculation since calculations are done on bits. This algorithm is faster when query is long. This is done without indexing.
Alternative to Bitap is BK Trees: This is based on triangle inequality with indexing. A data structure is constructed to search in the metric space.

Currently TextSimilarity uses cosine-similaity and n-gram approach, same can be implemented in NodeJS easily.


References:
hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://neil.fraser.name/software/diff_match_patch/bitap.ps‎


RELEVANT EXPERIENCE
I am proficient in Python. I have worked with several APIs which extract huge amounts of data. These scripts are being used by the organisation(InfoAssembly) where I interned. Sentiment Analysis of IMDB movie reviews using Python's NLTK library, Conversion into ISO Date format, Scraped 150 websites using Scrapy are some of the projects that I have worked on. I have worked with NodeJS and Javascript to build a Chrome Extension. Among databases I have experience in MongoDb and MySQL. All my projects have been pushed to Github, because of which I have experience in Github as well. I have experience in softwares such as WEKA, Matlab and Octave. I used to program in C/C++ but lately I have adopted Python.

Tentative Timeline

21st March - 4th April Study the proposed javascript module pattern properly and get familiarised with all the modules.

5th April - 20th April Build a javascript prototype for SpellChecker Module.


Community Bonding Period

21st April - 1st May Discuss the modules to be ported and brainstorm about the algorithms that could be used. Refine the objectives of the proposal.

2nd May - 18th May Get the Javascript module for SpellChecker module reviewed. Test it and discuss other techniques which could be used.

19th May - 31st May Improve the ported SpellChecker module. Prepare modules for ApproxSearch, Transliteration and Payyans.

31st May - 23rd June Extensive testing and improvements required is to be discussed


27th June - 12th July Prepare module for Soundex.

12th July - 5th August Prepare module for Chardetails, TextSimilarity, Silpa Sort and Indic Stemmer

5th August - 10th August Porting all modules and perform a unit test.

11th August - 18th August(Pencils Down) Improving documentation and writing tests for each module


POST GSOC Maintain the project and contribute more towards other parts of SMC

Other Relevant Information

Tell us about something you have created
I built a chrome extension to extract data out of website like hyperlinks, About us information, Products, Investor Relationships and other business related information. All this was stored in a database which can be used to access the information at a later stage.
I developed a recommendation engine for a startup using PHP and MYSQL based on item-to-item collaborative filtering.
I analysed the IMDB movies reviews using Sentiment Analyisis with Python's NLTK library to an accuracy of 89%


Have you communicated with a potential mentor? If so, who? Yes. I communicated with Santosh.


SMC Wiki link: http://wiki.smc.org.in/User:Vidya