User:Hrishikesh.kb/OCR

Corpus
Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh, IIIT Hyderabad,

Generation of large database of annotated document images involves
 * 1) Identification of the content/source
 * 2) Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
 * 3) consistent labeling procedures for annotation
 * 4) structured storage of annotation information for effective access.

Annotation
etc are usefull out of which 'content level' annotation of critical for OCR
 * Labelling image components(often with text)
 * Additional details such as
 * Layout information
 * Language/Script
 * Scanning parameters
 * Printing parameters
 * Levels of annotation
 * Structural level
 * Functional level
 * Content level