From SMC Wiki


Notes Based on the paper Building Data Sets for Indian Language OCR Research by C.V. Jawahar, Anand Kumar, A. Phaneendra, and K.J. Jinesh , IIIT Hyderabad,

Generation of large database of annotated document images involves

  1. Identification of the content/source
  2. Employing well-defined,repeatable pre-processing steps for creation of multiple images suited for various DIA tasks;
  3. consistent labeling procedures for annotation
  4. structured storage of annotation information for effective access.


  • Labelling image components(often with text)
  • Additional details such as
    • Layout information
    • Language/Script
    • Scanning parameters
    • Printing parameters

etc are usefull

  • Levels of annotation
    • Structural level
    • Functional level
    • Content level

out of which 'content level' annotation of critical for OCR