User:Hrishikesh.kb/OCR
From SMC Wiki
Corpus
Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,
Generation of large database of annotated document images involves
- Identification of the content/source
- Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
- consistent labeling procedures for annotation
- structured storage of annotation information for effective access.
Annotation
- Labelling image components(often with text)
- Additional details such as
- Layout information
- Language/Script
- Scanning parameters
- Printing parameters
etc are usefull
- Levels of annotation
- Structural level
- Functional level
- Content level
out of which 'content level' annotation of critical for OCR