User:Saatvikshah1994/gsoc

From SMC Wiki

Personal Information

Name: Saatvik Shah

Email : saatvikshah1994@gmail.com

Address: ---

IRCNick:tangy

Bitbucket Username: saatvikshah1994

Telephone : ---

Current Education: 2nd year undergraduate student, pursuing Computer Engineering at Malaviya National Institute of Technology- Jaipur (NIT-Jaipur)


Why do you want to work with the Swathanthra Malayalam Computing?

1. SMC provides some of the best open source modules for Indic Language Processing which prove useful to a large community. The guidance and mentoring that will be received coupled with the fresh perspective of ideas when developing software applications will enable me to assist in contributing meaningfully and enrich my knowledge and experience.

2. Simultaneously working on Python and Android which will increase work load while providing excellent experience along with the opportunity to work on such a large project.

3. Improve knowledge in the field of Natural Language Processing while simultaneously improving Developer Skills.


Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?

No


Did you participate with the past GSoC programs, if so which years, which organizations?

No


Do you have other obligations between May and August? Please note that we expect the Summer of Code to be a full time, 40 hour a week commitment

No, there are no other obligations and given an opportunity, will ensure that I dedicate my complete time between the months of May and August as detailed in the timeline below.


Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2014 program, if yes, which area(s), you are interested in?

I will be happy to continue along with other projects on hand. Further would prefer to contribute to NLP based modules, any additional android based extensions and possible Eclipse plugins.


Why should we choose you over other applicants?

1. Exposure to Android based application development for multiple prestigious projects(as highlighted below) is a strong merit. Strong Problem Solving Skills

2. Adept with basic Machine learning and Evolutionary Optimization Algorithms to solve the classification problem. In addition worked on feature extraction and reduction by algorithms such as PCA and ICA.

3. Python based projects on Regular Expressions, Web Scraping and various networking modules.

4. Previous experience in porting the popular ClamAV to Android with an improved model of implementation.


Proposal

Overview

Developing an Android SDK for SILPA with all the modules ported to an Android Library completely, enabling use of its modules to application developers on all platforms independent of Android Fragmentation. Adding secondary extensions such as popular fonts and server side support in specific modules. The SDK should be user-friendly, bug-free and backed up adequately by Documentation.


Needs it Fulfills

1. Developer obtains easy access to a rich set of routines for adding Indic Language Processing based features in their application, free of bugs as well as cross platform.

2. A well separated and classified set of libraries for every module with adequate Documentation to enable easy and efficient usage

3. Android requires that the processes running in the backend of the UI be fast to ensure that the UI is responsive. Creation of suitable data structures and addition of code optimizations to enable efficient as well as fast interactivity with users with a quick-responsive UI.


Experience

My programming experience is primarily in the following fields:

1. Android Application Development

2. Python for Automation : Web Scraping, Fast Networking

3. Matlab: For writing optimization algorithms, machine learning based evolutionary algorithms and image processing

4. Android Security and Testing : Use of tools such as drozer for testing defects and security leaks in user applications

I have highlighted my research and application development work in the “About Me” section elaborately.


Project Goals

1. Providing a stable back end with database and server support for Silpa Android SDK

2. Porting modules of Silpa from Python to Java and adding to the SDK

3. Providing developer-friendly approaches to access ported modules by providing most optimized code with minimal loss of computation and speedy execution

4. Rigorous testing and solving Fragmentation issues

5. Adequate Documentation

6. Additional : Extra Features in the SDK which utilises mobile portability and sensor inputs.

Proposal Implementation

1. Precoding Research

Studying the silpa modules and basic Android functionality which is to be added


2. Support for Native Libraries

Hook up a background service to make RPC calls to native C libraries such as libcairo and libpango. This will ensure fast functionality of modules using the native C libraries


3. Substituting External Libraries

External modules and dependencies used in python must be effectively substituted in Java. Finding the latest and most optimized Java libraries to substitute present working in Python. Since Java also has support for almost all the external modules which Python provides, this will not be too time consuming.(eg. Soundex can be implemented by using Apache Commons StringUtils Class which supports Levenshtein algorithm operations which can also be used for ApproxSearch,JOrtho Spell Checking Library whose databases can be updated with required rules,etc)


4. Establish a Work Flow

A work flow should initially be established to ensure fast implementation and effective mentor review. The modules to port must be shortlisted post compatibility checks, with precise details on how every present implementation should be ported to Java. Finally the methodology of development to follow(should the focus be on constant test-driven-development) which best satisfies the mentors and suits the developer.


5. Module Porting

The modules should be effectively ported and added to the SDK one by one adding support for:

a. Resources/XML Specifications

Every module has some backend resources which must be supported by the SDK. Java Hashmap is the equivalent of Python Dicts.Some of these resources include

1.Popular fonts/typefaces for supported languages

2.CharDetails : Sqlite database or Hashmap to store details of characters

3.Transliteration: CMU Hashmap or Database Table

4.SoundEx : Hashmap or Database of Language Characters to Phonetic Codes

5.Fortune : SQLite Database or Hashtables of Quotes

6.Spell Check and rules Hashtable or SQLite Database

7.Font Hashtables for Unicode to ASCII character conversion

8.Katapayadi Numbers : Hashtable of language bases of Sanskrit Numeral

9.Stemmer : Hashtable of Word->Stemmer Entries

10.UCA Sort : Hashtable of Sorting Rules for UCA

11.Payyans : Hashtable for ASCII to Unicode conversion rules

12.Hyphenate : Hashtable of Regular Expressions for Hyphen Pattern Matching While most of these can be setup as different XML resources or added to a packaged sqlite database some may have to be initialized directly in Java.


b. Java Libraries

Once the resources are available the routines for every module should be systematically ported to Java along with additional dependences

1.Initialization with resource and database support

2.Addition of external libraries and setting up rules for Indic processing

3.AsyncTasks and excellent threading support by Java to improve computation speeds of CRUD operations in database

4.Using content providers wherever possible such as for PDF conversion


c. Precise Java Implementations for most modules have been analyzed and discussed

1. Indicngram

N-gram is a NLP based algorithm to determine the next letter which uses the probabilistic model proposed in the Hidden Markov Model which determines the next state. (studied and gone through a basic implementation of HMMs for EEG classification in Matlab). We treat words as tokens to predict the next possible character/word


2. Guess Language :

The GuessLang C++ implementation can be ported in two ways :

a. Understand and port the basic GuessLang Algorithm using hashtable of rules

b. Google recently released the lang-guess java library which uses a bayesian classifier to get high classification accuracy and optimized methodology


3. Fortune :

Given a search word/pattern as input search through its database for quotes matching search pattern and returns from a list of such quotes a random quote


4. Katapayadi Numbers :

Corresponding Java : This algorithm can be replicated in java without much requirement of external dependencies using a hashtable and character parsing. First check which language number system the word belongs to from a set of language bases. Then apply the conversion algorithm from katapayadi number(used in the python module) to mathematical number by taking each character and using the hashtable of char -> number according to the number base


5. Text Similarity

Uses the indicngram(and charngram) and normalizer NLP modules in combination with cosine similarity for computing similarity by treating words as tokens and returns the multiplication of the Ngram's of two separate words as a result which is between 0 and 1.Java Implementation of Ngram is discussed above


6. Stemmer :

Replace the rules and word->stemmer dictionaries to corresponding hashmaps in Java. Implement the same algorithm as in python by trimming the word,removing punctuation,incrementing suffix by looping and test for the rules to find the stemmed form of the word.


7. Shingling :

Will require deeper understanding of the Python API for shingling


8. UCA Sort :

a.Understand how the UCA sort algorithm works.

b.Equivalent of pyuca.Collator in Java is java.text.Collator

c.Suitably implement UCA in Java with help of java.text library and modify sorting rules in java.text.Collator according to language specs


9. Payyans :

Setting up the rules hashmap and following the conversion via the python based algorithm by initial Normalization followed by rule based conversion


10. Hyphenate :

Guess the Language then apply hyphen pattern matching regular expressions for the guessed language from hashtab and finally detect patterns and apply hyphens at appropriate positions


11. Chardetails

Search through the input character's details in Database or hashtable


12. ScriptRender

Replace the Python Wiki Parser by java-wikipedia-parser or JSoup. Natively calling libcairo and libpango libraries using Android NDK.

Further compatibility testing is required for harffbuzz client side rendering


13. Spell Checking

JOrtho Spell Checking Library or use of Spell Checking Rules based hashtable for parsing string inputs to carry out spell checking.


c. Developer Access


i. Why use the Silpa SDK?

Silpa provides some of the best Indic Processing modules as explained above. Developers can use the Silpa SDK to add indic processing functionality into their application. While there are a number of NLP based libraries available for Android, there is no well supported SDK or libraries for Indic Processing based applications. Some practical and feasible applications are:

a. Indian Tourism based Applications

b. Movie/Food and other such reviewing and critic based Android Applications such as Zomato to allow people to read reviews and add their own reviews in their mother-tongue

c. An Indian WhatsApp where two people with different mother tongues can understand each other whilst using their different mother tongues and using their favourite quotes via Fortune

d. Script render can be added to common browsers in Android to save Wikis and eventually Web Pages in PDF format for fast future readability, sharing via Bluetooth and storing for later use as saving web pages is not very convenient in a normally cluttered file system.

e. Calculator supporting Katapayadi Numbers

f. Android Indic Text Editor to supporting searching, word processing, keyword listing,etc which common text editors provide.

g. Indic based Notes Application


ii. Importing the Silpa SDK in Application Project?

The Silpa SDK will essentially be consist of a compilation of JAR files as well as resources and required dependencies if any as discussed above. It can simple be imported into the application by using the Project Properties which supports addition of libraries and SDKs. Once this is done the Android Manifest of Developer and Silpa can be merged by setting manifestmerger.enabled to import required permissions, services and activities.


iii. Interface and Developer Interaction provided by the Silpa SDK?


Programming Interface The Silpa SDK will have a single SilpaSDK class file with which the developer will have to instantiate and which will have a parent class with child classes for individual modules. A simple example of how this will work will be

First Initialize the Silpa SDK with application context in whichever activity it is to be used

SilpaSDK myTransliterator = new IndicTransliterator(Context c);

Using different modules

The SilpaSDK parent class will have suitable child classes which will individually be linked with the ported modules in Android, the available resources and handlers and finally can be linked to customizable views. For example the SilpaSDK is the parent class to child classes : IndicTransliterate, ScriptRender, IndicCharDetails,etc. Each module, controlled by the respective child classes will also have a set of public variables to store keywords such as Language Names, Script Types,etc so that developer can easily use these without having to worry about keying in precise inputs in IDE's like Eclipse. Using a structured and hierarchical approach will be in interest of good programming practice and in unison with the OOP style of Java.

Example Implementations for a few modules


a. Transliteration

SilpaSDK myTransliterator = new IndicTransliterator(Context c);

String transliteratedText = myTransliterator.transliterate(myTransliterator.ENGLISH,myTransliterator.HINDI,String text); where myTransliterator.<LANG> will be predefined public variables so that developer need not bother with providing precise keywords


Developers generally want flexibility with SDK functionality, so if a developer wants to add transliteration option for Hindi to English he can mention as shown in the above method, the two languages between which he wants to transliterate and the corresponding text(obtained from an EditText) can be passed into the function available in the SDK to return the transliterated form of the text which can then be displayed by him on a TextView.


b. Fortune

SilpaSDK myQuotes= new IndicFortune(Context c,<Hashmap Resource>);

Where Hashmap Resource represents predefined XML Hashmap resource of Malyalam,Tamil,etc set of quotes.

String singleRandomQuote = myQuotes.getRandomQuote(String keyword,String pattern);

String[] allQuotes = myQuotes.getAllQuotes(String keyword);


c. TextSimilarity

SilpaSDK myTexts= new IndicTS(Context c);

double similarityFraction = myTexts.computeSimilarStrings(String text1,String text2);

double similarityFraction = myTexts.computeSimilarFiles(File f1,File f2,String extension); which will open and parse files and thus compute similarity

Other modules can similarly be implemented.


Resource Interface

XML Resources such as hashmaps for table rules, different fonts and syllables for every language, custom views, images,etc will also be added.

Fonts and sounds will be stored as raw resources which can then be loaded by getResources().openRawResource(resource);

Images can be stored as a drawable resource and be loaded as an ImageView or Bitmap by

Bitmap bmp = context.BitmapFactory.decodeResource(context.getResources(), R.drawable.my_indic_image);

Hashmaps for use in internal methods of module classes can be similarly loaded from XML resources which will all be previously packaged. Views can be used and customized directly by inflating and then adding components to the respective views.

View myCustomView = context.getViewInflate().inflate(R.layout.customview,null,null);


• User Interface

In the case of User Interface complete flexibility on how to use UI components will be available to the Developer.


A. Predefined Activities : For developers who want to directly use module User Interfaces preloaded with the necessary EditText, TextView and Buttons to support user input and automatic output with the developer receiving only the final result of the output, initially a set of activities with multiple XML Templates along with necessary animations will be added and their corresponding Java Activities defined. Using startActivityForResult() these developers can directly start the prebuilt Activity whose final result after processing will be made available for them as the result of the Activity. Use of latest Android UI features such as Fragments, Action Bar, Sliding Menu and widget support can be further added


B. Custom Views : A set of custom views can be provided for developers who choose to add their own designing customizations. Such views can be defined in the XML layouts and later be inflated by the developer themselves according to their preferences and then customized


d. Adding Views

Optionally sample views can be provided(EditText for input,custom typefaces, Progress Bars,etc) to make developer's work easier with flexibility provided to the developer to customize provided views or create new ones according to his preferences


e. Test Driven Development

Modules will be added after which in periodic intervals JUnit driven testing, Monkey Testing, Stress Testing and Security Testing to prevent memory leaks or database holes will be conducted. Solving cross-platform compatibility issues.


f. Final Software and Hardware Testing

The final packages of the SDK will be tested on actual hardware wherever possible and extensively on Virtual Machines.Fragmentation Issues and Hardware Dependencies can be detected and solved here.

Upload SDK to Online Repositories and provide plugin in Eclipse/Android Studio for developers to directly download and start using the SDK.


6. Additional Features

SDK can support additional features not mentioned in the Ideas Page such as

a. Location Based Transliteration : GPS integration by which application can detect present location depending on which user preferences can be set. For example if a person goes to Gujarat the application can automatically set itself to set the default options for modules such as transliteration to Gujarati

b. Android IME Extension : Improvements in Silpa's Android IME extension in terms of computation speed and improved rpc calling via same background service

These are just a few amongst many possible additions which Android can support due to its portability,GPS,Bluetooth and additional hardware modules.


7. Documentation

Adequate Documentation with JavaDoc covering complete implementation

Timeline

INITIALIZATION & RESEARCH (April 8th - May 18th)

April 8th - April 12th : Learning how to call external C libraries(like libcairo and pango) ,compiling Harffbuzz and additional functionality in Android via the NDK.

April 12th - April 27th : Familiarize myself with modules used in the silpa project with help of mentors

(April 28th - May 7th: College End Term Exams)

May 8th : Discuss and shortlist with mentors,which and how the modules are to be ported to Java-Android platform.

May 9th  : Understand creation of Android libraries.

May 10th - May 12th : Go through the Python Modules, check for externally used libraries and find replacement Java libraries

May 12th - May 16th :Develop a rough plan about how each module will be replicated in Java and added to the SDK. Possibly incompatible features will be tested and solutions proposed and discussed.

May 17th - May 18th : Final discussions about starting code implementation and finalizing the ports planned according to mentor suggestions.Split all modules into two halves according to porting difficulty.


CODING PERIOD(May 19th - August 1st)

May 19th - June 1st : Adding first half of the modules into Android/Java

June 1st - June 3rd(Code Review) : Discussing optimization of classes,methods and data structures used with mentors

June 3rd - June 7th : Add the suggested improvements and changes

June 7th - June 14th : Aggregating the created Java libraries into the Silpa SDK

June 14th - June 17th(Bug Fixing Period) : Monkey Testing(and additional Stress Testing), Testing for Security Issues, Solving Cross Platform Issues

June 18th - June 20th : Adding server side support.Continue removing Bugs and solving cross compatibility.

June 21st - June 22nd : Midterm Evaluation of Work Progress and Implementation Completed, Corrections and Changes suggested by Mentors

June 23rd - June 25th : Adding suggested changes and code improvements

June 26th - July 8th : Adding second half of the modules into Android/Java. Adding server side support for required modules.

July 9th - July 10th(Code Review) : Discussing optimization of classes, methods and data structures used with mentors

July 11th - July 20th (Bug Fixing Period): Adding suggestions, Aggregating into SDK, Stress Testing newly added modules. Possible addition of new features provided by android not existing in python such as AsyncTask, excellent threading support, etc.

July 20th - July 31st: Final suggestions and changes discussed and implemented


DOCUMENTATION AND CONCLUSION (August 1st - August 22nd)

August 1st - August 15th : Backup Time/Final Testing/Documentation Begins

August 16th - August 22nd : Documentation Work Completion


POST GSOC

Actively improve and contribute by providing Android Extensions, Posting any newly added modules, Adding additional SDK tools, NLP based improvements,Eclipse Extensions



About Me

Projects Undertaken and Completed

• Indian Railways Maintenance Application : Received Certificate and pending patent on Android Application/Website for Detection and Maintenance of Railway Station Gears and Equipment via GPS Integration and further centralization on Database and a Web Server. Application developed for and currently being used by North East Frontier Railway.

• Android Security : Developed demo botnet android application under Computer Engineering Department as a project. Runs as a background service to log User Calls, Contacts, Location, control SMS sending and a number of other features connected with a web server using minimal network. Not detected by multiple Android Antiviruses such as Avast Antivirus and many other.

• CloudClam : Port of ClamAV(Best OpenSource Antivirus) as an Android Application which follows the model of ClamAV to detect Android Malware. Additional implementation of improved speed and faster detection by combination of a smaller offline database with larger online database and better string-matching algorithms. Currently a demo prototype has been developed and work is currently on better algorithm implementation and ideal structure to optimize speed and efficiency.

• Python Web Scraping Tool : Developed a Website Scraper which extracts regulatory information for Stock Brokers and Banks from respective website-This help the firm which uses the tool for their search engine to save 1 person cost.

• All in One Communicator : An All in One Android Application giving the user a single place to maintain Contacts, send messages and send emails.

• Presented a paper to Memetic Computing on Group based swarm evolution algorithm (GSEA) for classification of mental tasks based on neural networks

• B.C.I Controlled Wheelchair – A TEQIP supported Project: Brain Computer Interface controlled wheelchair by using a Machine Learning based neural classifier(only final GUI development is remaining)


Awards/accolades

• Tryst,IIT Delhi: Won the third prize at Connecting the Dots, Image Processing Competition,IIT Delhi

• Stood 3rd(1st in Android Application Category) in Android Application/Web Application Development competition organized by US startup SwiftDay for colleges in Jaipur.


Additional Links

1. Certificate Received by Indian Railways[1]

2. Research Paper on GSEA (in communication with Memetic Computing)[2]


Contact with Mentors

Contacted Jishnu Mohan and Hrishikesh K.B clarified the project goals and what would be expected.