GSoC/2014/Project ideas: Difference between revisions

From SMC Wiki
(adding mentors to silpa based tasks.)
No edit summary
Line 26: Line 26:


If you want to propose an idea, please do it in [http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in student projects mailing list]>
If you want to propose an idea, please do it in [http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in student projects mailing list]>
=Projects with confirmed mentors=


== A spell checker for Indic language that understands inflections ==
== A spell checker for Indic language that understands inflections ==
'''Project''':
'''Project''':


Line 37: Line 42:
* '''[https://savannah.nongnu.org/task/index.php?12558 Savannah Task]'''
* '''[https://savannah.nongnu.org/task/index.php?12558 Savannah Task]'''
* '''Expertise required''': Average level understanding of grammar system of at least one Indian language
* '''Expertise required''': Average level understanding of grammar system of at least one Indian language
* Complexity: Advanced
* '''Complexity''': Advanced
* '''Mentor''' : Santhosh Thottingal
* '''Mentor''': Santhosh Thottingal
 


==Indic rendering support in ConTeXt==
==Indic rendering support in ConTeXt==
'''Project''':
'''Project''':


ConTeXt is another TeX macro system similar to LaTeX but much more suitable for design. To find more information about ConTeXt, see the wiki http://wiki.contextgarden.net/Main_Page. ConTeXt MKII  have Indic language rendering support using XeTeX. but MKII is deprecated, and the new MKIV backend doesn't support Indic rendering yet. The aim of this project is to add support to Inidic rendering to ConTeXt MKIV. XeTeX is using Harfbuzz to do correct Indic rendering.
ConTeXt is another TeX macro system similar to LaTeX but much more suitable for design. To find more information about ConTeXt, see the wiki http://wiki.contextgarden.net/Main_Page. ConTeXt MKII  have Indic language rendering support using XeTeX. but MKII is deprecated, and the new MKIV backend doesn't support Indic rendering yet. The aim of this project is to add support to Inidic rendering to ConTeXt MKIV. XeTeX is using Harfbuzz to do correct Indic rendering.
* '''[https://savannah.nongnu.org/task/index.php?12559 Savannah Task]
* '''Expertise required''': Understanding of the TeX system, experience in either LaTeX or ConTeXt and basic understanding of Indic language rendering. MKIV uses Lua, familiarity with Lua, opentype specifications or Harfbuzz will be added advantage.
* '''Mentor''' : Rajeesh K Nambiar
* '''Complexity''' : Advanced


* '''More Details''': A partially working patch by Rajeesh for MKIV lua code is available. ConTeXt mkii (deprecated) can work with XeTeX backend for Indic rendering. Here is a sample file:
* '''More Details''': A partially working patch by Rajeesh for MKIV lua code is available. ConTeXt mkii (deprecated) can work with XeTeX backend for Indic rendering. Here is a sample file:
Line 60: Line 61:
Generate the output using command
Generate the output using command
  texexec --xetex <file.tex>
  texexec --xetex <file.tex>
* '''[https://savannah.nongnu.org/task/index.php?12559 Savannah Task]'''
* '''Expertise required''': Understanding of the TeX system, experience in either LaTeX or ConTeXt and basic understanding of Indic language rendering. MKIV uses Lua, familiarity with Lua, opentype specifications or Harfbuzz will be added advantage.
* '''Complexity''' : Advanced
* '''Mentor''' : Rajeesh K Nambiar


==Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx==
==Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx==
'''Project''':


CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language.  Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.
CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language.  Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.


'''Mentor''':  Deepa P. Gopinath
* '''Background Reading'''
=== Background Reading ===
** [http://www.cs.cmu.edu/~gopalakr/publications/spdatabases_specom05.pdf 'Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems'], Gopalakrishna  Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V. Sitaram, S P Kishore
** [http://www.aclweb.org/anthology/W/W12/W12-5808.pdf "Automatic Pronunciation Evaluation And Mispronunciation Detection Using CMUSphinx"], Ronanki Srikanth, James Salsman
** http://www.speech.cs.cmu.edu/
** http://cmusphinx.sourceforge.net/wiki/tutorial
** [http://www.ijarcsse.com "HTK Based Telugu Speech Recognition"], P. Vijai Bhaskar, AVNIET ,Hyderabad, Prof. Dr. S. Rama Mohan Rao, A.Gopi
** [http://www.cs.cmu.edu/~araza/Automatic_Speech_Recognition_System_for_Urdu.PDF "Design and  Development of an Automatic Speech Recognition System for Urdu"], Agha Ali Raza,  M.Sc. Thesis, FAST‐National University of Computer and Emerging Sciences
** [http://www.ccis2k.org/iajit/PDF/vol.6,no.2/11IASRUCSS186.pdf "Investigation Arabic Speech Recognition Using CMU Sphinx System"], Hassan Satori1, 2, Hussein Hiyassat3, Mostafa Harti1, 2, and Noureddine Chenfour
** [http://www.try.idv.tw/static-resources/homework/pr/PR_Final_Report.pdf "Understanding the CMU Sphinx Speech Recognition System"], Chun-Feng Liao
* '''Mentor''':  Deepa P. Gopinath




* [http://www.cs.cmu.edu/~gopalakr/publications/spdatabases_specom05.pdf 'Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems'], Gopalakrishna  Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V. Sitaram, S P Kishore
==Silpa based==


* [http://www.aclweb.org/anthology/W/W12/W12-5808.pdf "Automatic Pronunciation Evaluation And Mispronunciation Detection Using CMUSphinx"], Ronanki Srikanth, James Salsman


* http://www.speech.cs.cmu.edu/
===Provide REST API for new flask based Silpa, including conversion of templates to this REST API from JSON RPC===
* http://cmusphinx.sourceforge.net/wiki/tutorial


* [http://www.ijarcsse.com "HTK Based Telugu Speech Recognition"], P. Vijai Bhaskar, AVNIET ,Hyderabad, Prof. Dr. S. Rama Mohan Rao, A.Gopi
'''Project''':  


* [http://www.cs.cmu.edu/~araza/Automatic_Speech_Recognition_System_for_Urdu.PDF "Design and  Development of an Automatic Speech Recognition System for Urdu"], Agha Ali Raza, M.Sc. Thesis, FAST‐National University of Computer and Emerging Sciences
Silpa is now relying on JSONRPC. We need to, either completely move to REST API or provide REST API as an additional feature.
* '''Expertise required''': Python , Flask , Jinja , HTML, Javascript
* '''Mentor''' : Vasudev/Jishnu


* [http://www.ccis2k.org/iajit/PDF/vol.6,no.2/11IASRUCSS186.pdf "Investigation Arabic Speech Recognition Using CMU Sphinx System"], Hassan Satori1, 2, Hussein Hiyassat3, Mostafa Harti1, 2, and Noureddine Chenfour


* [http://www.try.idv.tw/static-resources/homework/pr/PR_Final_Report.pdf "Understanding the CMU Sphinx Speech Recognition System"], Chun-Feng Liao
===Android SDK for Silpa===


==SILPA BASED==
'''Project''':


===Provide REST API for new flask based Silpa, including conversion of templates to this REST API from JSON RPC===
Port possible Silpa modules to java and create SDK so that other developers can use this for their apps. Modules like Indic Render, Transliteration, Payyas has really good potential in android because of the fragmentation exists in Android and lack for proper Indic support. This SDK will help developers to support their Indic app in wide range of android devices.
'''Project''': Silpa is now relying on JSONRPC. We need to, either completely move to REST API or provide REST API as an additional feature.


'''Expertise required''': Python , Flask , Jinja , HTML, Javascript
* '''Expertise required''': Java, Android, Python
* '''Mentor''' : Jishnu/Hrishikesh/Aashik


'''Mentor''' : Vasudev/Jishnu


===Android SDK for Silpa===
===Converting indic processing modules currently in SILPA into javascript modules library===
'''Project''': Port possible Silpa modules to java and create SDK so that other developers can use this for their apps. Modules like Indic Render, Transliteration, Payyas has really good potential in android because of the fragmentation exists in Android and lack for proper Indic support. This SDK will help developers to support their Indic app in wide range of android devices.


'''Expertise required''': Java, Android, Python
'''Project''':  
 
'''Mentor''' : Jishnu/Hrishikesh/Aashik
 
=== Converting indic processing modules currently in SILPA into javascript modules library  ===
'''Project''': Port some of the silpa algorithms to node modules.


Several modules, alogorithms in SILPA project is done in python now. But porting them to javascript helps developers. For example, cross language transliteration can be done javascript too if we port the algorithm and transliteration rules. Similarly the approximate search can be ported. A flexibile fuzzy search on the web pages will be possible if we have the algorithm in javascript.
Port some of the silpa algorithms to node modules. Several modules, alogorithms in SILPA project is done in python now. But porting them to javascript helps developers. For example, cross language transliteration can be done javascript too if we port the algorithm and transliteration rules. Similarly the approximate search can be ported. A flexibile fuzzy search on the web pages will be possible if we have the algorithm in javascript.


Proposed javascript module pattern is https://github.com/umdjs/umd
Proposed javascript module pattern is https://github.com/umdjs/umd
Line 109: Line 116:
Student proposals should have a list of alogorithms planning to port, planned demo applications, planned documentation details, and publishing details(Example: npm registry)
Student proposals should have a list of alogorithms planning to port, planned demo applications, planned documentation details, and publishing details(Example: npm registry)


'''Expertise required''': javascript, python
* '''Expertise required''': javascript, python
* '''Mentor''' : Jishnu


'''Mentor''' : Jishnu


===  Improving cross language transliteration system.  ===
===  Improving cross language transliteration system.  ===
'''Project''':
'''Project''':


Line 120: Line 128:
CLDR has transliteration data for Indic languages. We can explore it and see the feasibility. For an intermediate representation of the scripts either IPA can be used or ISO 15919 standard can be used. All these must be supplemented with exception rules and special case handling to achieve more perfect result.
CLDR has transliteration data for Indic languages. We can explore it and see the feasibility. For an intermediate representation of the scripts either IPA can be used or ISO 15919 standard can be used. All these must be supplemented with exception rules and special case handling to achieve more perfect result.


'''Expertise required''':python
* '''Expertise required''':python
* '''Mentor''' : Vasudev/Jishnu


'''Mentor''' : Vasudev/Jishnu


=== Internationalize SILPA project with Wikimedia jquery projects ,  Improve  the webfonts module in Silpa using jquery.webfonts and provide more Indic and complex fonts as part of it ===
=== Internationalize SILPA project with Wikimedia jquery projects ,  Improve  the webfonts module in Silpa using jquery.webfonts and provide more Indic and complex fonts as part of it ===
'''Project''':  
'''Project''':  


'''Internationalize SILPA''' :-  SILPA project has many Indic language applications, but as of now, if somebody want to input in Indian languages, there is no built in tool in it. Similarly, the application is not internationalized. Both of these can be achieved by using the [//github.com/wikimedia/jquery.ime jquery.ime] and [//github.com/wikimedia/jquery.ime jquery.i18n] libraries from Wikimedia. A sample implementation is avaliable in our [http://smc.org.in website]. The i18n should be in the SILPA flask framework with a nice templating system. Similarly the interface should have webfonts using [https://github.com/wikimedia/jquery.webfonts jquery.webfonts] library.
'''Internationalize SILPA''' :-   
SILPA project has many Indic language applications, but as of now, if somebody want to input in Indian languages, there is no built in tool in it. Similarly, the application is not internationalized. Both of these can be achieved by using the [//github.com/wikimedia/jquery.ime jquery.ime] and [//github.com/wikimedia/jquery.ime jquery.i18n] libraries from Wikimedia. A sample implementation is avaliable in our [http://smc.org.in website]. The i18n should be in the SILPA flask framework with a nice templating system. Similarly the interface should have webfonts using [https://github.com/wikimedia/jquery.webfonts jquery.webfonts] library.


'''Improve  the webfonts ''' :-  
'''Improve  the webfonts ''' :-  
Line 138: Line 148:
* **This is partly done**.  
* **This is partly done**.  


====More Details====
* [https://github.com/wikimedia/jquery.i18n jquery.i18n]
* [https://github.com/wikimedia/jquery.ime jquery.ime]
* [https://github.com/wikimedia/jquery.webfonts jquery.webfonts]


'''Expertise required''': jQuery, css, html5, Python , flask , technical understanding about fonts
* '''More Details'''
** [https://github.com/wikimedia/jquery.i18n jquery.i18n]
** [https://github.com/wikimedia/jquery.ime jquery.ime]
** [https://github.com/wikimedia/jquery.webfonts jquery.webfonts]
* '''Expertise required''': jQuery, css, html5, Python , flask , technical understanding about fonts
* '''Mentor''' : Jishnu/Vasudev


'''Mentor''' : Jishnu/Vasudev


==Language filter for diaspora==
==Language filter for diaspora==
Line 153: Line 163:
* '''Expertise required''': Ruby on Rails
* '''Expertise required''': Ruby on Rails
* '''Mentor''': Pirate Praveen, Ershad K
* '''Mentor''': Pirate Praveen, Ershad K


==Varnam Based==
==Varnam Based==
Line 161: Line 172:


To try out Varnam, navigate to [http://varnamproject.com/editor[http://varnamproject.com/editor]]. Currently it support Hindi and Malayalam.
To try out Varnam, navigate to [http://varnamproject.com/editor[http://varnamproject.com/editor]]. Currently it support Hindi and Malayalam.


===Improvements to the REST API===
===Improvements to the REST API===


This includes rewrite of the current implementation in `golang` and add
'''Project''':
support for WebSockets to improve the input experience. This also
includes making scripts that would ease embedding input on any webpage.
All the changes done will go live on[1]


'''Expertise required''': Basic understanding of golang and C
This includes rewrite of the current implementation in `golang` and add support for WebSockets to improve the input experience. This also
includes making scripts that would ease embedding input on any webpage. All the changes done will go live on[1]


'''Complexity''' : Advanced
* '''Expertise required''': Basic understanding of golang and C
* '''Complexity''': Advanced
* '''Mentor''': Navaneeth S


=== Improve the learning system===


The main goal of this is to improve how varnam tokenizes when learning
===Improve the learning system===
words. Today, when a word is learned, varnam takes all the possible
prefixes into account and learn all of them to improve future
suggestions. But sometimes, this is not enough to predict good
suggestions. An improvement is suggested which will try to infer the
base form of the word under learning


'''Expertise required''': Knowledge in C
'''Project''':
 
The main goal of this is to improve how varnam tokenizes when learning words. Today, when a word is learned, varnam takes all the possible prefixes into account and learn all of them to improve future suggestions. But sometimes, this is not enough to predict good suggestions. An improvement is suggested which will try to infer the base form of the word under learning.
 
* '''Expertise required''': Knowledge in C
* '''Complexity''': Medium
* '''Mentor''': Navaneeth S


'''Complexity''' : Medium


=== Word corpus synchronization ===
=== Word corpus synchronization ===


Create a cross-platform synchronization tool which can upload/download
'''Project''':
the word corpus from offline IMEs like varnam-ibus[2]. This helps to
build the online words corpus easily.


'''Expertise required''': Knowledge in C/golang
Create a cross-platform synchronization tool which can upload/download the word corpus from offline IMEs like varnam-ibus[2]. This helps to build the online words corpus easily.


'''Complexity''' : Medium
* '''Expertise required''': Knowledge in C/golang
* '''Complexity''' : Medium
* '''Mentor''': Navaneeth S


* [1]: http://www.varnamproject.com
* [1]: http://www.varnamproject.com
* [2]: https://gitorious.org/varnamproject/libvarnam-ibus/
* [2]: https://gitorious.org/varnamproject/libvarnam-ibus/


'''Mentor''':  Navaneeth K N


== Adding Braille Keyboard layouts for Indian Languages to m17n Library==
==Adding Braille Keyboard layouts for Indian Languages to m17n Library==
 
'''Project''':


Project is building support for Bharati Braille keyboard layouts in GNU/Linux systemes.  Bharati Braille standard is the official Braille standard in India. A regular QWERTY keyboard is used for data entry. SDF-JKL keys are used for six dots of Braille. This support need to be built as m17n layouts. This will enable visually challenged people who studied braille layouts to use GNU/Linux systems easily with the help of Audio feedback from TTS
Project is building support for Bharati Braille keyboard layouts in GNU/Linux systemes.  Bharati Braille standard is the official Braille standard in India. A regular QWERTY keyboard is used for data entry. SDF-JKL keys are used for six dots of Braille. This support need to be built as m17n layouts. This will enable visually challenged people who studied braille layouts to use GNU/Linux systems easily with the help of Audio feedback from TTS


====More Details====
* '''More Details'''
* http://www.acharya.gen.in:8080/disabilities/bh_brl.php
** http://www.acharya.gen.in:8080/disabilities/bh_brl.php
* http://en.wikipedia.org/wiki/Bharati_Braille
** http://en.wikipedia.org/wiki/Bharati_Braille
* http://www.nongnu.org/m17n/
** http://www.nongnu.org/m17n/
* '''Mentor''': Anivar Aravind
 
 


'''Mentor''': Anivar Aravind
=Projects with unconfirmed mentors=


==Grandham ==
==Grandham ==
=== Adding MARC21 import/export feature in Grandham ===
=== Adding MARC21 import/export feature in Grandham ===


Line 221: Line 237:
'''Complexity''' : High
'''Complexity''' : High


[1]: http://dev.grandham.org
* [1]: http://dev.grandham.org
[2]: https://github.com/smc/grandham
* [2]: https://github.com/smc/grandham

Revision as of 19:28, 25 February 2014

Apart from the following ideas , you can propose your own ideas

Potential Mentors

  1. Santhosh Thottingal (santhosh on irc.freenode.net)
  2. Baiju M (baijum on irc.freenode.net)
  3. Praveen A (j4v4m4n on irc.freenode.net)
  4. Rajeesh K Nambiar (rajeeshknambiar on irc.freenode.net)
  5. Vasudev Kammath (copyninja on irc.freenode.net)
  6. Jishnu Mohan (jishnu7 on irc.freenode.net)
  7. Hrishikesh K.B (stultus on irc.freenode.net)
  8. Anivar Aravind (anivar on irc.freenode.net)
  9. Anilkumar K V (anilkumar on irc.freenode.net)
  10. Sajjad Anwar (geohacker on irc.freenode.net)
  11. Deepa V Gopinath (deepagopinath on irc.freenode.net)
  12. jain Basil (jainbasil on irc.freenode.net)
  13. Ershad K (ershad on irc.freenode.net
  14. Navaneeth (nkn__ on irc.freenode.net)
  15. Nishan Naseer (nishan on irc.freenode.net)
  16. Nandaja Varma (gem on irc.freenode.net)

Ideas for Google Summer of Code 2014

  • Please Read the FAQ


If you want to propose an idea, please do it in student projects mailing list>


Projects with confirmed mentors

A spell checker for Indic language that understands inflections

Project:

SILPA project has a spellchecker written using python with a not so simple algorithm. But still it is not capable of handling inflection and agglutination occurring in Indian languages especially south Indian languages. The dictionary we have for Malayalam spellchecker have about 150000 words. Of course we can expand the dictionary, but that doesn't have much value since words can be formed in Malayalam or Tamil etc by joining multiple words. In addition to that, words get inflected based on grammar forms(sandhi), plural, gender etc. Hunspell has a system to handle this, but so far nobody succeeded in getting it working for multi level suffix stripping as required for Malayalam. Some times a Malayalam word can be formed by more than 5 words joining together. We will need a word splitting logic or a table taking care of all patterns. The project is to attempt solving this with hunspell. If that is not feasible(hunspell upstream is not active), develop an algorithm and implement it.

Recently Tamil attempted developing a spellchecker using Hunspell with multi level suffix stripping. You can see the result here https://github.com/thamizha/solthiruthi. Our attempt should be first to use Hunspell to achieve spellchecking with agglutination and inflection. Probably it will require lot of scripting to generate suffix patterns, we can ask help from existing language communities too. If Hunspell has limitation with multi level suffxes- sometimes Indian languages require more than 5 levels of suffix stripping, we need to document it(bug and documentation) and try to attempt python based solution on top of SILPA framework.

  • Savannah Task
  • Expertise required: Average level understanding of grammar system of at least one Indian language
  • Complexity: Advanced
  • Mentor: Santhosh Thottingal


Indic rendering support in ConTeXt

Project:

ConTeXt is another TeX macro system similar to LaTeX but much more suitable for design. To find more information about ConTeXt, see the wiki http://wiki.contextgarden.net/Main_Page. ConTeXt MKII have Indic language rendering support using XeTeX. but MKII is deprecated, and the new MKIV backend doesn't support Indic rendering yet. The aim of this project is to add support to Inidic rendering to ConTeXt MKIV. XeTeX is using Harfbuzz to do correct Indic rendering.

  • More Details: A partially working patch by Rajeesh for MKIV lua code is available. ConTeXt mkii (deprecated) can work with XeTeX backend for Indic rendering. Here is a sample file:
\usemodule[simplefonts]
\definefontfeature[malayalam][script=mlym]
\setmainfont[Rachana][features=malayalam]
\starttext
മലയാളം \TeX ഉപയോഗിച്ച് ടൈപ്പ്സെറ്റ് ചെയ്തത്
\stoptext

Generate the output using command

texexec --xetex <file.tex>
  • Savannah Task
  • Expertise required: Understanding of the TeX system, experience in either LaTeX or ConTeXt and basic understanding of Indic language rendering. MKIV uses Lua, familiarity with Lua, opentype specifications or Harfbuzz will be added advantage.
  • Complexity : Advanced
  • Mentor : Rajeesh K Nambiar

Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx

Project:

CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language. Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.


Silpa based

Provide REST API for new flask based Silpa, including conversion of templates to this REST API from JSON RPC

Project:

Silpa is now relying on JSONRPC. We need to, either completely move to REST API or provide REST API as an additional feature.

  • Expertise required: Python , Flask , Jinja , HTML, Javascript
  • Mentor : Vasudev/Jishnu


Android SDK for Silpa

Project:

Port possible Silpa modules to java and create SDK so that other developers can use this for their apps. Modules like Indic Render, Transliteration, Payyas has really good potential in android because of the fragmentation exists in Android and lack for proper Indic support. This SDK will help developers to support their Indic app in wide range of android devices.

  • Expertise required: Java, Android, Python
  • Mentor : Jishnu/Hrishikesh/Aashik


Converting indic processing modules currently in SILPA into javascript modules library

Project:

Port some of the silpa algorithms to node modules. Several modules, alogorithms in SILPA project is done in python now. But porting them to javascript helps developers. For example, cross language transliteration can be done javascript too if we port the algorithm and transliteration rules. Similarly the approximate search can be ported. A flexibile fuzzy search on the web pages will be possible if we have the algorithm in javascript.

Proposed javascript module pattern is https://github.com/umdjs/umd

Student proposals should have a list of alogorithms planning to port, planned demo applications, planned documentation details, and publishing details(Example: npm registry)

  • Expertise required: javascript, python
  • Mentor : Jishnu


Improving cross language transliteration system.

Project:

Currently only Kannada and Malayalam are perfect rest all are first converted to Malayalam then to English due to lack of language internal. Also currently for English to Indic we use CMUDict so transliteration capability is limited to words in CMUDict only probably we could develop better method for English to Indic transliteration

CLDR has transliteration data for Indic languages. We can explore it and see the feasibility. For an intermediate representation of the scripts either IPA can be used or ISO 15919 standard can be used. All these must be supplemented with exception rules and special case handling to achieve more perfect result.

  • Expertise required:python
  • Mentor : Vasudev/Jishnu


Internationalize SILPA project with Wikimedia jquery projects , Improve the webfonts module in Silpa using jquery.webfonts and provide more Indic and complex fonts as part of it

Project:

Internationalize SILPA :- SILPA project has many Indic language applications, but as of now, if somebody want to input in Indian languages, there is no built in tool in it. Similarly, the application is not internationalized. Both of these can be achieved by using the jquery.ime and jquery.i18n libraries from Wikimedia. A sample implementation is avaliable in our website. The i18n should be in the SILPA flask framework with a nice templating system. Similarly the interface should have webfonts using jquery.webfonts library.

Improve the webfonts  :-

  • Currently Silpa provides 36 webfonts. add more fonts to this collection.
  • Rewrote webfonts module to use the features of jquery.webfonts
  • reate a repo as per jquery.webfonts specification
  • Provide a clean api so that other websites can use our webfonts in their websites
  • Document the usage
  • Provide font preview and download options
  • **This is partly done**.



Language filter for diaspora

Diaspora is a Free Software, federated social networking platform. Diaspora users post in many languages. When people use more than one language in their posts, it is inconvenient for people who don't understand a language. This task is to tag every post with languages used in the post, ideally detected automatically, but with an option to override it. Once each post has a language tag, people should be able to choose their preferred language and posts in other languages should be hidden by default. Also provide an option to translate posts and comments.

  • Expertise required: Ruby on Rails
  • Mentor: Pirate Praveen, Ershad K


Varnam Based

Varnam is a cross-platform predictive transliterator for Indian languages. It works mostly like Google's transliterate, but shows key differences in the way word tokenization is done. It has a learning system built in which allows Varnam to make smart predictions.

There are varnam clients available as Firefox] & Chrome addon and an IBus engine.

To try out Varnam, navigate to [http://varnamproject.com/editor]. Currently it support Hindi and Malayalam.


Improvements to the REST API

Project:

This includes rewrite of the current implementation in `golang` and add support for WebSockets to improve the input experience. This also includes making scripts that would ease embedding input on any webpage. All the changes done will go live on[1]

  • Expertise required: Basic understanding of golang and C
  • Complexity: Advanced
  • Mentor: Navaneeth S


Improve the learning system

Project:

The main goal of this is to improve how varnam tokenizes when learning words. Today, when a word is learned, varnam takes all the possible prefixes into account and learn all of them to improve future suggestions. But sometimes, this is not enough to predict good suggestions. An improvement is suggested which will try to infer the base form of the word under learning.

  • Expertise required: Knowledge in C
  • Complexity: Medium
  • Mentor: Navaneeth S


Word corpus synchronization

Project:

Create a cross-platform synchronization tool which can upload/download the word corpus from offline IMEs like varnam-ibus[2]. This helps to build the online words corpus easily.

  • Expertise required: Knowledge in C/golang
  • Complexity : Medium
  • Mentor: Navaneeth S


Adding Braille Keyboard layouts for Indian Languages to m17n Library

Project:

Project is building support for Bharati Braille keyboard layouts in GNU/Linux systemes. Bharati Braille standard is the official Braille standard in India. A regular QWERTY keyboard is used for data entry. SDF-JKL keys are used for six dots of Braille. This support need to be built as m17n layouts. This will enable visually challenged people who studied braille layouts to use GNU/Linux systems easily with the help of Audio feedback from TTS


Projects with unconfirmed mentors

Grandham

Adding MARC21 import/export feature in Grandham

We need a feature in Grandham to import and parse data from MARC21 documents. We should also be able to export existing data in MARC21.

Expertise required: Knowledge in Ruby/Ruby on Rails

Complexity : High