From SMC Wiki
< User:Hrishikesh.kb
Revision as of 10:56, 15 January 2015 by Hrishikesh.kb (talk | contribs) (added category)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,

Generation of large database of annotated document images involves

  1. Identification of the content/source
  2. Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
  3. consistent labeling procedures for annotation
  4. structured storage of annotation information for effective access.


  • Labelling image components(often with text)
  • Additional details such as
    • Layout information
    • Language/Script
    • Scanning parameters
    • Printing parameters

etc are usefull

  • Levels of annotation
    • Structural level
    • Functional level
    • Content level

out of which 'content level' annotation of critical for OCR