User:Hrishikesh.kb/OCR: Difference between revisions

From SMC Wiki
(Created page with " == Corpus == Notes Based on the paper '''Building Data Sets for Indian Language OCR Research''' by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad...")
 
m (added category)
 
Line 23: Line 23:
** Content level  
** Content level  
out of which 'content level' annotation of critical for OCR
out of which 'content level' annotation of critical for OCR
[[Category:OCR]]

Latest revision as of 10:56, 15 January 2015

Corpus

Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,


Generation of large database of annotated document images involves

  1. Identification of the content/source
  2. Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
  3. consistent labeling procedures for annotation
  4. structured storage of annotation information for effective access.

Annotation

  • Labelling image components(often with text)
  • Additional details such as
    • Layout information
    • Language/Script
    • Scanning parameters
    • Printing parameters

etc are usefull

  • Levels of annotation
    • Structural level
    • Functional level
    • Content level

out of which 'content level' annotation of critical for OCR