User:Haseeb/Urdu Support to Silpa

From SMC Wiki

GSoC 2013: Urdu Support to Silpa

Overview

Silpa, Swathanthra Indian Language Processing Applications is a web framework and a set of applications for processing Indian Languages. Silpa supports many Indic languages, the project aims at extending Silpa's functionality by adding "Urdu" support to related and existing modules during this summer.


Project Details:

Development & The Way Forward


Urdu Script:

Urdu is an Indo-Aryan language. The script it uses is derived from Arabic and Persian, but to suit the particular requirements of Indo-Aryan phonology, particularly aspiration, retroflexion and nasalization, it has been suitably modified. It is cursive in nature. The letters are of two types, connectors and non-connectors. The connectors combine with the following letters in the word or the syllable, while the non-connectors cannot combine with the following letters. However, all letters combine with the preceding connector ones. Most of the letters have three shapes, initial when they occur in the beginning, medial when they occur in the middle and finally joined when they occur at the end of a word. The final unjoined shape is the same as the basic letter.


Writing System:

The script is written from right to left.


Sequence of Urdu Letters:


Ref: http://en.wikipedia.org/wiki/Urdu_alphabet


Vowels:

The long vowels in Urdu are indicated by alif ( ا ), alif-mad ( آ ), vāo ( و ), choṭī yē ( ې ) and baṛī yē ( ے ). The superscript mad ( ٓ ) written over alif, e.g., آ , denotes long /ā/ at the beginning of a word. However, in medial and final position alif ( ا ) by itself stands for a long /ā/. Yē ( ې ) and vāo ( و ) when occurring initially, stand for semi-vowel /y/ and /v/ respectively, such as, /yahā̃/ ( یہاں ), /vahā̃/ ( وہاں ). Vāo ( و ), choṭī yē ( ې ), baṛī yē ( ے ) in other environments denote long vowels.


Short Vowels:

The short vowels in Urdu are indicated by superscript or subscript as indicated below:

Above a consonant is called 'zabar'. It denotes a following /a/: ...ٓ...

Below a consonant is called 'zēr'. It denotes a following /i/: ...̗...

Above a consonant is called 'pēsh'. It denotes a following /u/: ...ٓ...


Modules that i will be working to add Urdu support:

  • Transliteration - i have already started working on this Urdu Transliteration
  • Guess Language
  • Dictionary - i will be using wikitionary Dump
  • Spell Checker
  • Syllabification
  • Soundex
  • Approximate search
  • Shingling Library
  • Fortune Cookies - I will be adding Urdu Shayari of famous urdu poets.
  • Hyphentator
  • Ngram

Test Suite

Test suites will be written for each module using Python Standard "Unittest" framework. Currently there is no test suite for modules.

Benefits:

  • It will spread the wings of Silpa project
  • It will be more popular among Urdu users, since there aren't much Urdu softwares available
  • It tend to motivate organizations to use Silpa as it will forward their aim of communication and collaboration with other Indic languages. Hence adding another flag to Silpa's powerful support.
  • It will enable Silpa to extend its collaborative nature to the next level. Specially in School's , College's and Universities.


Roadmap:

  • Learning : Already started and on Going.
  • Programming: 8 weeks
  • Test suite: 1 week
  • Final review and adding time and Bug fixing time :1 week
  • Documentation : 1 week


Methodology


My development process will follow the standard Silpa development process, under the guidance of my mentor.Each module will be developed in a branch. When the code and matching unit tests are finished, they will go through code review by mentor to ensure it follows the coding standard, is well designed, is sufficiently tested and documented, etc. Once the issues from the code review have been addressed the branch will be merged into trunk.

Small,frequent feedback via code review, and the requirement for doing testing and documentation will ensure I am learning and improving throughout the summer, and will enhance my ability to get code merged into Silpa from the very beginning.


How i came up with this?


During discussion with Vasudev kamath(coyninja_), He suggested to propose this. Originally i proposed the idea "word thesaurus" for Indic languages, due to unavailability of data i had to drop the idea.


Motivation


I am FOSS enthusiast, in my free time i do translations to couple of open source projects. SMC is great initiative for Indic languages and i like the community.The project will open new doors for Urdu users.


Level of Difficulty


Medium


Potential Mentor


Vasudev Kamath ‏(copyninja_)


Why me?


I like coding in Python which is language Silpa is developed, i have some scripts , which mostly uses Flask Bitbucket.I am native speaker of Urdu, and can read, write and speak Urdu well enough. I do know other languages Hindi, Kannada and some little Telgu, i also know Arabic(i can read but can't speak :) ). I am also familiar with Silpa source code and i already started working on Transliteration module. Also in my free time i do translation to couple of open source projects.I am fairly good with Git and Mercurial .

Timeline


It's difficult to plan how the work will be done. In the last week of June and second week of July I have my semester exams for my bachelor degree. Therefore I would like to start working during Community Bonding period. May 27 - June 17 = Community Bonding + Transliteration + Guess Language with Test suites.

June 18 - June 30 = Hyphentator with Test suite.

July 1 - July 4 = Fortune Cookies.

July 5 - July 9 = I have exam.

July 10 - July 15 = Continue of Fortune Cookies and Test suite.

July 16 - July 30 = Spell checker and Dictionary with Test suites.


July 31 - August 2 = Midterm time. Submit the code. I'll keep this time free in case I've fallen behind schedule.


Aug 3 - Aug 18 = Soundex and Approx Search with Test suites.

Aug 19 - Aug 30 = Shingling Library and Ngram with Test suites.

Aug 31 - Sept 10 = Syllabification with Test suite.

Sept 11 - Sept 21 = Buffer time, Documentation, Final review, Extensive testing to prevent as much bugs as possible.

Sept 22 - Sept 23 = Final Submission.

Personal information


Email Address: abdulraufhaseeb@gmail.com

Blog URL: http://terrificking.blogspot.in/

Freenode IRC Nick: haseeb

University and current education: My university is Visveswaraya Technological University and Currently pursuing Bachelor of engineering in Electronics and Communication from K.B.N college of engineering Gulbarga.

Past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor:I didn't contributed to SMC formally, however did some minor contributions created requirements.txt and modules.txt in Silpa source code structure for easy Installation of Silpa and one code clean up.

I am active member of opensource community in my College and City. I have been taking part in many opensource meetups and events(mostly in Bangalore). I have also volunteered with many events like Hasgeek's events,PyCon India etc.This time i handled PyCon India's Registration both Online as well as Onsite. I am also the founder of my Local Linux User group in my town (ilug-gulbarga).

Did you participate with the past GSoC programs, if so which years, which organizations?

No. I didn't participated

OS and Editor:I use Arch Linux on top of Gnome as my operating System, i like Sublime text 2 editor, there are some features i really like "Code completion" , "color themes" etc

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?

Yes i will continue to work SMC and i will maintain Urdu related tasks. I have also a future Goal "Word Thesaurus" for Indic languages :).