User:Hrishikesh.kb/OCR

From SMC Wiki
Revision as of 10:34, 15 January 2015 by Hrishikesh.kb (talk | contribs) (Created page with " == Corpus == Notes Based on the paper '''Building Data Sets for Indian Language OCR Research''' by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Corpus

Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,


Generation of large database of annotated document images involves

  1. Identification of the content/source
  2. Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
  3. consistent labeling procedures for annotation
  4. structured storage of annotation information for effective access.

Annotation

  • Labelling image components(often with text)
  • Additional details such as
    • Layout information
    • Language/Script
    • Scanning parameters
    • Printing parameters

etc are usefull

  • Levels of annotation
    • Structural level
    • Functional level
    • Content level

out of which 'content level' annotation of critical for OCR