GSoC/2016/Project ideas: Difference between revisions

From SMC Wiki
No edit summary
(speech has confirmed mentor)
 
(2 intermediate revisions by one other user not shown)
Line 13: Line 13:
# Anivar Aravind ('''anivar''' on irc.freenode.net)
# Anivar Aravind ('''anivar''' on irc.freenode.net)
# Hrishikesh K.B ('''stultus''' on irc.freenode.net)
# Hrishikesh K.B ('''stultus''' on irc.freenode.net)
# Deepa Gopinath (deepagopinath on irc.freenode.net)




Line 21: Line 22:
=Projects with confirmed mentors=
=Projects with confirmed mentors=
== Indic Keyboard ==
== Indic Keyboard ==
https://gitlab.com/smc/indic-keyboard/issues or https://github.com/smc/Indic-Keyboard/issues


'''Project'''ː
'''Project'''ː Better user on boarding / first time user experience.


'''Details''': We should have tutorials and other informations on the setup wizard.


'''Complexity''':
'''Complexity''': Easy


'''Confirmed Mentor''': Jishnu Mohan
'''Confirmed Mentor''': Jishnu Mohan
Line 32: Line 33:
'''How to contact the mentor''': IRC - jishnu7 on #smc-project or #silpa on Freenode  
'''How to contact the mentor''': IRC - jishnu7 on #smc-project or #silpa on Freenode  


'''Expertise required''': Java / Android,  
'''Expertise required''': Java / Android, Fix at least one bug that is marked as beginner/easy in https://gitlab.com/smc/indic-keyboard/issues or https://github.com/smc/Indic-Keyboard/issues
 
'''Project'''ː Better user on boarding / first time user experience.
 
 
'''Project'''ː Convert keyboard into a SDK format.
 
'''Details''': An SDK will help third party developers to bundle IndicKeyboard in their app.
 
'''Use case''': Apps which are targeted at Indian audience can accept text input without asking user to install and configure Indic Keyboard. Libindic SDK provides similar feature. But it has only transliteration. Indic Keyboard provides full featured layouts like Inscript and Phonetic.
 
'''Complexity''': Difficult
 
'''Confirmed Mentor''': Jishnu Mohan


'''What the student will learn'''
'''How to contact the mentor''': IRC - jishnu7 on #smc-project or #silpa on Freenode


'''Expertise required''': Java / Android, Fix at least one bug that is marked as beginner/easy in https://gitlab.com/smc/indic-keyboard/issues or https://github.com/smc/Indic-Keyboard/issues


== libindic - Android ==
== libindic - Android ==


'''Project'''ː
'''Project'''ː Merge transliteration module and android-ime of IndicKeyboard
 
'''Details''': Transliteration module which we use in IndicKeyboard is significantly better and well tested. We should use that instead of maintaining a second library.


'''Complexity''':
'''Complexity''': Easy


'''Confirmed Mentor''': Jishnu Mohan
'''Confirmed Mentor''': Jishnu Mohan
Line 47: Line 64:
'''How to contact the mentor''': IRC - jishnu7 on #smc-project or #silpa on Freenode  
'''How to contact the mentor''': IRC - jishnu7 on #smc-project or #silpa on Freenode  


'''Expertise required''': Java / Android,
'''Expertise required''': Java / Android
 


'''What the student will learn'''
 
'''Project'''ː Finish Demo app and release it in Google Play store
 
'''Details''': We need to fix current issues and update the demo app to attract developers.
 
'''Complexity''': Easy
 
'''Confirmed Mentor''': Jishnu Mohan
 
'''How to contact the mentor''': IRC - jishnu7 on #smc-project or #silpa on Freenode
 
'''Expertise required''': Java / Android




Line 115: Line 144:
# [https://www.varnamproject.com/ Varnam]
# [https://www.varnamproject.com/ Varnam]
# IndicKeyboard [https://play.google.com/store/apps/details?id=org.smc.inputmethod.indic Playstore] | [https://github.com/androidtweak/Indic-Keyboard Github]
# IndicKeyboard [https://play.google.com/store/apps/details?id=org.smc.inputmethod.indic Playstore] | [https://github.com/androidtweak/Indic-Keyboard Github]
==Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx==
'''Project''':
CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language.  Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.
'''Background Reading'''
* [http://www.cs.cmu.edu/~gopalakr/publications/spdatabases_specom05.pdf 'Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems'], Gopalakrishna  Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V. Sitaram, S P Kishore
* [http://www.aclweb.org/anthology/W/W12/W12-5808.pdf "Automatic Pronunciation Evaluation And Mispronunciation Detection Using CMUSphinx"], Ronanki Srikanth, James Salsman
* http://www.speech.cs.cmu.edu/
* http://cmusphinx.sourceforge.net/wiki/tutorial
* [http://www.ijarcsse.com "HTK Based Telugu Speech Recognition"], P. Vijai Bhaskar, AVNIET ,Hyderabad, Prof. Dr. S. Rama Mohan Rao, A.Gopi
* [http://www.cs.cmu.edu/~araza/Automatic_Speech_Recognition_System_for_Urdu.PDF "Design and  Development of an Automatic Speech Recognition System for Urdu"], Agha Ali Raza,  M.Sc. Thesis, FAST‐National University of Computer and Emerging Sciences
* [http://www.ccis2k.org/iajit/PDF/vol.6,no.2/11IASRUCSS186.pdf "Investigation Arabic Speech Recognition Using CMU Sphinx System"], Hassan Satori1, 2, Hussein Hiyassat3, Mostafa Harti1, 2, and Noureddine Chenfour
* [http://www.try.idv.tw/static-resources/homework/pr/PR_Final_Report.pdf "Understanding the CMU Sphinx Speech Recognition System"], Chun-Feng Liao
'''Complexity''' :
'''Mentor''' : Deepa Gopinath
'''How to contact the mentor''':
'''Expertise required''':
'''What the students will learn''':


=Projects with unconfirmed mentors=
=Projects with unconfirmed mentors=
Line 176: Line 231:


'''What the students will learn''':
'''What the students will learn''':
==Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx==
'''Project''':
CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language.  Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.
'''Background Reading'''
* [http://www.cs.cmu.edu/~gopalakr/publications/spdatabases_specom05.pdf 'Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems'], Gopalakrishna  Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V. Sitaram, S P Kishore
* [http://www.aclweb.org/anthology/W/W12/W12-5808.pdf "Automatic Pronunciation Evaluation And Mispronunciation Detection Using CMUSphinx"], Ronanki Srikanth, James Salsman
* http://www.speech.cs.cmu.edu/
* http://cmusphinx.sourceforge.net/wiki/tutorial
* [http://www.ijarcsse.com "HTK Based Telugu Speech Recognition"], P. Vijai Bhaskar, AVNIET ,Hyderabad, Prof. Dr. S. Rama Mohan Rao, A.Gopi
* [http://www.cs.cmu.edu/~araza/Automatic_Speech_Recognition_System_for_Urdu.PDF "Design and  Development of an Automatic Speech Recognition System for Urdu"], Agha Ali Raza,  M.Sc. Thesis, FAST‐National University of Computer and Emerging Sciences
* [http://www.ccis2k.org/iajit/PDF/vol.6,no.2/11IASRUCSS186.pdf "Investigation Arabic Speech Recognition Using CMU Sphinx System"], Hassan Satori1, 2, Hussein Hiyassat3, Mostafa Harti1, 2, and Noureddine Chenfour
* [http://www.try.idv.tw/static-resources/homework/pr/PR_Final_Report.pdf "Understanding the CMU Sphinx Speech Recognition System"], Chun-Feng Liao
'''Complexity''' :
'''Confirmed Mentor''' :
'''How to contact the mentor''':
'''Expertise required''':
'''What the students will learn''':


==libindic Project Based==
==libindic Project Based==

Latest revision as of 13:29, 22 March 2016

Apart from the following ideas , you can propose your own ideas

Potential Mentors

  1. Santhosh Thottingal (santhosh on irc.freenode.net)
  2. Sayamindu Das Gupta (unmadinduon irc.freenode.net)
  3. Rajeesh K Nambiar (rajeesh on irc.freenode.net)
  4. Vasudev Kammath (copyninja on irc.freenode.net)
  5. Jishnu Mohan (jishnu7 on irc.freenode.net)
  6. Navaneeth (nkn__ on irc.freenode.net)
  7. Samuel Thibault (youpi on irc.freenode.net)
  8. Anivar Aravind (anivar on irc.freenode.net)
  9. Hrishikesh K.B (stultus on irc.freenode.net)
  10. Deepa Gopinath (deepagopinath on irc.freenode.net)


Ideas for Google Summer of Code 2016

Projects with confirmed mentors

Indic Keyboard

Projectː Better user on boarding / first time user experience.

Details: We should have tutorials and other informations on the setup wizard.

Complexity: Easy

Confirmed Mentor: Jishnu Mohan

How to contact the mentor: IRC - jishnu7 on #smc-project or #silpa on Freenode

Expertise required: Java / Android, Fix at least one bug that is marked as beginner/easy in https://gitlab.com/smc/indic-keyboard/issues or https://github.com/smc/Indic-Keyboard/issues

Projectː Better user on boarding / first time user experience.


Projectː Convert keyboard into a SDK format.

Details: An SDK will help third party developers to bundle IndicKeyboard in their app.

Use case: Apps which are targeted at Indian audience can accept text input without asking user to install and configure Indic Keyboard. Libindic SDK provides similar feature. But it has only transliteration. Indic Keyboard provides full featured layouts like Inscript and Phonetic.

Complexity: Difficult

Confirmed Mentor: Jishnu Mohan

How to contact the mentor: IRC - jishnu7 on #smc-project or #silpa on Freenode

Expertise required: Java / Android, Fix at least one bug that is marked as beginner/easy in https://gitlab.com/smc/indic-keyboard/issues or https://github.com/smc/Indic-Keyboard/issues

libindic - Android

Projectː Merge transliteration module and android-ime of IndicKeyboard

Details: Transliteration module which we use in IndicKeyboard is significantly better and well tested. We should use that instead of maintaining a second library.

Complexity: Easy

Confirmed Mentor: Jishnu Mohan

How to contact the mentor: IRC - jishnu7 on #smc-project or #silpa on Freenode

Expertise required: Java / Android


Projectː Finish Demo app and release it in Google Play store

Details: We need to fix current issues and update the demo app to attract developers.

Complexity: Easy

Confirmed Mentor: Jishnu Mohan

How to contact the mentor: IRC - jishnu7 on #smc-project or #silpa on Freenode

Expertise required: Java / Android


ibus-braille module modifications

Projectː

This project will be to make improvements on the project that was successfully completed by a student under SMC. The remaining tasks areː

  1. Integrate Ibus-Braille with Liblouis
  2. Create Table editor for Liblouis
  3. Create a web version and host it.
  4. Add more indian languages to Liblouis
  5. Add facility to write direct braille Unicode characters
  6. Remove espeak dependency and make accessible via orca itself.

Complexity:

Confirmed Mentor: Samuel Thibault

How to contact the mentor: IRC - youpi on #smc-project on Freenode

Expertise required:

What the student will learn

Varnam based

Varnam is a cross-platform predictive transliterator for Indian languages. It works mostly like Google's transliterate, but shows key differences in the way word tokenization is done. It has a learning system built in which allows Varnam to make smart predictions.

There are varnam clients available as Firefox] & Chrome addon and an IBus engine.

To try out Varnam, navigate to [http://varnamproject.com/editor].

Add Varnam support into Indic Keyboard

Project:

As part of this project, students can add support for Varnam into IndicKeyboard. This involves roughly the following steps:

  1. Compiling libvarnam for Android
  2. Writing JNI wrappers for the libvarnam library
  3. Hooking up varnamd on Android to do the word corpus synchronization
  4. Add varnam support to IndicKeyboard

Before submitting proposals:

  1. Ensure you can program using C, Java and golang
  2. Ensure you have varnam libraries compiled on your local machine
  3. Ensure you have IndicKeyboard setup on your local machine


Complexity: Advanced

Confirmed Mentor: Navaneeth K. N.

How to contact the mentor: IRC - nkn__ on #smc-project on Freenode

Expertise required: C, Java, golang, Android

What the student will learn: How to write an input system for Android


Reference:

  1. Varnam
  2. IndicKeyboard Playstore | Github

Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx

Project:

CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools, which can be used to develop speech recognition system in any language. To develop an automatic speech recognition system in a language, acoustic model and language model has to framed for that particular language. Acoustic models characterize how sound changes over time. It captures the characteristics of basic recognition units. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. It attempts to convey behavior of the language and tries to predict the occurrence of specific word sequences possible in the language. Once these two models are developed, it will be useful to every one doing research in speech processing. For Indian languages Hindi, Tamil, Telugu and Marati, ASR systems have been developed using sphinx engine. In this project work is aimed at developing acoustic model and language model for Malayalam.

Background Reading

Complexity :

Mentor : Deepa Gopinath

How to contact the mentor:

Expertise required:

What the students will learn:

Projects with unconfirmed mentors

A spell checker for Indic language that understands inflections

Project:

libindic project has a spellchecker written using python with a not so simple algorithm. But still it is not capable of handling inflection and agglutination occurring in Indian languages especially south Indian languages. The dictionary we have for Malayalam spellchecker have about 150000 words. Of course we can expand the dictionary, but that doesn't have much value since words can be formed in Malayalam or Tamil etc by joining multiple words. In addition to that, words get inflected based on grammar forms(sandhi), plural, gender etc. Hunspell has a system to handle this, but so far nobody succeeded in getting it working for multi level suffix stripping as required for Malayalam. Some times a Malayalam word can be formed by more than 5 words joining together. We will need a word splitting logic or a table taking care of all patterns. The project is to attempt solving this with hunspell. If that is not feasible(hunspell upstream is not active), develop an algorithm and implement it.

Recently Tamil attempted developing a spellchecker using Hunspell with multi level suffix stripping. You can see the result here https://github.com/thamizha/solthiruthi. Our attempt should be first to use Hunspell to achieve spellchecking with agglutination and inflection. Probably it will require lot of scripting to generate suffix patterns, we can ask help from existing language communities too. If Hunspell has limitation with multi level suffxes- sometimes Indian languages require more than 5 levels of suffix stripping, we need to document it(bug and documentation) and try to attempt python based solution on top of libindic framework.

The project is not about coding an existing algorithm, but to develop and implement an algorithm.

Hunspell's limitations can be understood from this conversation we had with the author of Hunspell in 2008

Homework to do before submitting applications:

  1. Use Hunspell in any Indian language like Malayalam for spell correction in editors or word processors and understand the limitations
  2. Study the nature of inflection and agglutination in Indian languages, read existing documents on this(ask for documents too) and note down your observations
  3. Study Hunspell and other spellcheckers to see how this problem is addressed
  4. Understand how a spell checker works. How to write a spellchecker from scratch?
  5. Come up with a plan about addressing the issue.

Complexity: Advanced

Confirmed Mentor:

How to contact the mentor:

Expertise required: Average level understanding of grammar system of at least one Indian language and complete the homework as listed above.

What the student will learn:

Indic rendering support in ConTeXt

Project:

ConTeXt is another TeX macro system similar to LaTeX but much more suitable for design. To find more information about ConTeXt, see the wiki http://wiki.contextgarden.net/Main_Page. ConTeXt MKII have Indic language rendering support using XeTeX. but MKII is deprecated, and the new MKIV backend doesn't support Indic rendering yet. The aim of this project is to add support to Inidic rendering to ConTeXt MKIV. XeTeX is using Harfbuzz to do correct Indic rendering.

More Details: A partially working patch by Rajeesh for MKIV lua code is available. ConTeXt mkii (deprecated) can work with XeTeX backend for Indic rendering. Here is a sample file:

\usemodule[simplefonts]
\definefontfeature[malayalam][script=mlym]
\setmainfont[Rachana][features=malayalam]
\starttext
മലയാളം \TeX ഉപയോഗിച്ച് ടൈപ്പ്സെറ്റ് ചെയ്തത്
\stoptext

Generate the output using command

texexec --xetex <file.tex>

Complexity : Advanced

Confirmed Mentor :

How to contact the mentor:

Expertise required: Understanding of the TeX system, experience in either LaTeX or ConTeXt and basic understanding of Indic language rendering. MKIV uses Lua, familiarity with Lua, opentype specifications or Harfbuzz will be added advantage.

What the students will learn:

libindic Project Based

libindic Project Improvements

Project:

This is set of ideas needed to improve the existing libindic infrastructure. We have decided following tasks as part of this project

  1. Provide REST API to libindic without disturbing existing JSONRPC API
  2. Improve the Transliteration module
  3. Integrate Flask Webfonts extension with libindic to provide Webfonts support.

Improve Transliteration module

We have a Transliteration module which supports transliteration from any Indic language to other Indic language and also support to English to Indic and Indic to English transliteration. Also we support IPA and ISO15919 transliteration system. But the module isn't in perfect shape and has lot of bugs. With this idea we would like to improve the following parts

  1. Improve cross indic language transliteration system. Currently only Malayalam and Kannada are working without any external language support, all other Indian languages are first transliterated to Malayalam and then transliterated to target Indic language. We want to remove this cycle from source -> Malayalam -> target.
  2. English to IPA transliteration is currently broken and this needs to be fixed. See IPA transliteration bug.
  3. Once the IPA transliteration issue above is fixed, imporve English to Indic transliteration system using IPA. Currently English to Indic transliteration system is done using CMU Sphinx dictionary which is having limited set of words which inturn limits the output of English to Indic transliteration system.
  4. Improve IS015919 to Indic transliteration system see IS015919 to Indic transliteration.

CLDR has transliteration data for Indic languages. We can explore it and see the feasibility. For an intermediate representation of the scripts either IPA can be used or ISO 15919 standard can be used. All these must be supplemented with exception rules and special case handling to achieve more perfect result.

Converting indic processing modules currently in libindic into javascript modules library

Project:

Port some of the libindic algorithms to node modules. Several modules, alogorithms in libindic project is done in python now. But porting them to javascript helps developers. For example, cross language transliteration can be done javascript too if we port the algorithm and transliteration rules. Similarly the approximate search can be ported. A flexibile fuzzy search on the web pages will be possible if we have the algorithm in javascript.

Proposed javascript module pattern is https://github.com/umdjs/umd

Student proposals should have a list of alogorithms planning to port, planned demo applications, planned documentation details, and publishing details(Example: npm registry)

Complexity :

Confirmed Mentor :

How to contact the mentor:

Mailing List: silpa-discuss@nongnu.org

Expertise required: javascript, python

What the students will learn:

Integrate Varnam into libindic

Create a libindic module which hosts varnam. This includes making a python port for libvarnam and making a libindic module which uses the python port.

Complexity : Medium

Confirmed Mentor :

How to contact the mentor:

Mailing List: silpa-discuss@nongnu.org

Expertise required: C, Python

Grandham

Adding MARC21 import/export feature in Grandham

We need a feature in Grandham to import and parse data from MARC21 documents. We should also be able to export existing data in MARC21.


Complexity : High

Confirmed Mentor :

How to contact the mentor:

Expertise required: Knowledge in Ruby/Ruby on Rails

What the students will learn: