User:Nandaja/GSoC 2013 Automated Rendering Testing
Personal information
- Name: Nandaja Varma
- Email Address: <nandaja.varma AT gmail DOT com>
- Freenode IRC Nick: gem
- University and current education ː BTech Computer Science, Calicut University (NSS College palakkad)
- Blog URL: nandajavarma.wordpress.com
Why do you want to work with the Swathanthra Malayalam Computing?
Since I came to know about the activities of SMC (That would be by the starting of my second year studies at college), I wanted to be a part of this community and make some significant contributions to it. I see this as a great opportunity for it. Would like to do the same through any other means possible, as well.
Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?
Yes, I have recently started contributing to SMC's Gnome localization team. I make contributions to Debian community as a packager, mainly packaging Ruby gems for Debian. I also got involved in digitalization works with Malayalam Wikigrandhashala recently.
Did you participate with the past GSoC programs, if so which years, which organizations?
No, I did not.
Do you have other obligations between May and August? Please note that we expect the Summer of Code to be a full time, 40 hours a week commitment ?
I have no other obligations whatsoever between the proposed months. I will be able to make this 40 hours a week commitment GSoC.
Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?
Yes, Most definitely. I would like to continue my contributions with the localization works as translation is one of my area of interests. I would also like to make major contributions to SMC's rendering fixing related works.
Why should we choose you over other applicants?
I have understanding of the rendering engine, Harfbuzz's working and I have played with a couple of scripts which basically prints the glyph index of a particular text in a particular font. As of implementing my project idea, I have good knowledge in C programming language and have good reading and writing skills in Malayalam. This would definitely help me in creating the list of base glyph words for this project. Also I also have quite a clear knowledge on test rendering stack and its constituent modules.
Project Description
An Overview of your proposal
Harfbuzz is an opensource development library for shaping Unicode text, specifically complex scripts. Developing an automated mechanism to test what has been rendered by harfbuzz for different Indic languages is the main objective of this project. As of now, there is no actual mechanism to check if Harfbuzz is rendering the text correctly. As harfbuzz is a very efficient, widely used and undoubtedly about to be used for a long time to come, the project is highly relevant. The proposed system has the ability to test renderings in different indic languages using different fonts.
The need you think it fulfills
Implementation of the above mentioned idea can make sure that what is being rendered be harfbuzz is actually correct. It would make it easier for developers or users if such a mechanism exists because now the only way to do is manually testing it, which can be time consuming and is error prone. Also, anyone can get the renderings tested even if she knows the particular language that is being rendered or not.
Any relevant experience you have
I have decent knowledge in C programming language, in which Harfbuzz is implemented. I am quite familiar with harfbuzz architecture and its renderings. Also my knowledge on test rendering stacks, glyphs and Unicode encoding would definitely help in taking me further. I also have experience in localization and digitalization works which, I hope, will help me at some points of the project.
How you intend to implement your proposal
Harfbuzz is a shape rendering engine for Unicode text, especially complex scripts. Harfbuzz basically offers two utilities hb-view and hb-shape for testing and viewing the rendering. hb-view gives as its output the view of the rendered unicode character based on its font, basically as an image where as hb-shape actually gives as its output the glyph index of that particular character based on its font. For example if we give the command: hb-view Rachana.ttf മലയാളം , We get an output like this: [m1=0+1046|l3=1+1462|y1=2+1624|uni0D3E=2+826|lh=4+1134|uni0D02=4+856], which is basically the glyph index of the word 'മലയാളം'. Glyphs represent the shapes that characters can have when they are rendered or displayed. Opentype is the prominent font standard used today. Opentype font technology deals with glyphs where as Unicode deals with characters. Glyph indices are mapping between a Unicode character to its corresponding glyph(s). So Glyph indices are one of the most important things to be dealt with when it comes to rendering.
So, to implement this idea of making the testing automated, What will be done is evaluating the output of hb-shape functionality. As it shows the glyph index of any word that we give as input, we can check this value for correctness. So the methodology to be followed to check this for correctness will be as follows: Create a baseline glyph words list that consist of a word and it's corresponding glyph index for each font. This must contain the correct rendering of each of the words specified. We will have to create this particular list for every indic language for which we are planning to implement this testing. For creating this table, we can make use of fontforge, which is a font editor that can be used to create fonts. So we will get the layout of each character in this application. We can create a baseline glyph words table using the glyph index data that we can fetch from fontforge for different indic languages. But, obviously, we cannot create a table with every single character or character combinations possible, which is difficult as well as less efficient as it will drastically affect the comparing procedure. So special care should be taken to create a table that consists of most important characters that might go wrong and should not, special case characters , etc. We have to intelligently pick the words or character combinations which can significantly decrease the total number of entries in this list, which presumably can be entered into a database or a more efficient database like a hash table or a trie can be used to fastly search for the data while providing our list as a separate text file.
Then script should be written, in C, to accept hb-shape output as input and then check it against out baseline glyph word, find the exact matching word as see if the glyph indices match. If it doesn't, then that can be flagged as being incorrectly rendered. Also, it might so happen that the comparing words do not appear in the list we provide. Here comes the efficiency of the words we have chosen. Either we can assume that the particular word or character is very rarely used or assume that the word input was given wrong. If the hit of a same word happens more than a certain number of times we can say that out assumptions were wrong and we can think of a mechanism to get this particular word flagged and then add its corresponding glyph index, as an upgrade.
Also, to interact with this proposed library, a Web front end can also be made, in PHP, to make it more user friendly rather that using the command line.
A rough timeline for your progress with phases
- Week 1 - 2 : Learn more about Opentype and Unicode. Learn well the way usually font shapes are rendered in engines and how to they appear when we combine characters to words and what will the changes happened to the glyph indices be.
- Week 2 - 3 : Create a list of words or characters in Unicode that necessarily is needed to test against with harfbuzz output. Most preferably, this one in Malayalam. Select the words efficiently to make the whole list effective as well as concise.
- Week 4 - 5 : Start coding for the application with the collected data as the baseline.
- week 6 : Test the code against some Harfbuzz Malayalam Renderings against the provided list and make changes accordingly to make it perform perfect and faster.
- Week 7 - 8 : Create the baseline glyph word index for as many indic languages as possible, although there is time and linguistic barriers. Planning to collect it at least for Hindi.
- Week 9 : Creating the web front end for the application.
- Week 10 : Testing, reviewing and documentation.
Tell us something about you have created
I have created a prototype search engine, using Hadoop in the back end and python for further ranking processes with a web page as an interface.
Have you communicated with a potential mentor? If so who?
Yes, I have communicated with the mentor Rajeesh K Nambiar.
SMC Wiki link of your proposal
Progress
20/07/2013
- Started coding for the project three days ago.
- As for my current developing code the inputs needed are a file with a list of words/characters, the rendering of which are to be tested. Along with that, the correct glyph names of the words/characters. This is extracted manually from font forge at the moment. Eg: ക[k1]
- The next file needed is a file with output of harfbuzz renderings of all the words/characters chosen for testing. A separate script is written for this purpose which is to be executed on the test words file which will yield an output of the form: ക[k1=0+1588]. This is actually the output of hb-shape command. The value following the = will be ignored for now.
- In the testing script, the first file will be opened, read the characters appearing before [, i.e our word/character. Then until the ] sign is encountered the strings(elimination =, + and digits) will be added to an array. The same character will be looked up in the harfbuzz rendered outputs' file and the glyph names will be similarly collected in an array.
- Then compare the two strings. If both are the same we enter a value 0 to a check array. Check[i] = 0. Otherwise, check[i] = 1.
- The last two steps are repeated until end of the file is encountered.
- After that, we look up the check array. All the words listed at ith position with check[i] = 1 will be stored on to a separate file.
- Finally we can run another script on this results file to get the hb-view outputs of these words to get a better understanding of the rendering mistake.
- Further corrections to the above algorithm will be updated periodically.