User:Hrishikesh.kb/OCR

From SMC Wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Corpus

Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,


Generation of large database of annotated document images involves

  1. Identification of the content/source
  2. Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
  3. consistent labeling procedures for annotation
  4. structured storage of annotation information for effective access.

Annotation

  • Labelling image components(often with text)
  • Additional details such as
    • Layout information
    • Language/Script
    • Scanning parameters
    • Printing parameters

etc are usefull

  • Levels of annotation
    • Structural level
    • Functional level
    • Content level

out of which 'content level' annotation of critical for OCR