User:Hrishikesh.kb/OCR: Difference between revisions
From SMC Wiki
(Created page with " == Corpus == Notes Based on the paper '''Building Data Sets for Indian Language OCR Research''' by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad...") |
m (added category) |
||
Line 23: | Line 23: | ||
** Content level | ** Content level | ||
out of which 'content level' annotation of critical for OCR | out of which 'content level' annotation of critical for OCR | ||
[[Category:OCR]] |
Latest revision as of 10:56, 15 January 2015
Corpus
Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,
Generation of large database of annotated document images involves
- Identification of the content/source
- Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
- consistent labeling procedures for annotation
- structured storage of annotation information for effective access.
Annotation
- Labelling image components(often with text)
- Additional details such as
- Layout information
- Language/Script
- Scanning parameters
- Printing parameters
etc are usefull
- Levels of annotation
- Structural level
- Functional level
- Content level
out of which 'content level' annotation of critical for OCR