R&D -> Optical Character Recognition
  Optical Character Recognition

         Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text (numerals, letters, and symbols), into a computer processable format (such as ASCII).

General Introduction to OCR System

  OCR for Assamese & Manipuri

Technical Description of the Technology:
 
OCR (Optical Character Recognition) is the process of recognizing printed or written text by the computer. The characters in a text are scanned in order to produce their bitmap representations and then analyzed in order to determine which characters they correspond to. This analysis is a translation of each character image into a character code (typically the corresponding ISFOC or ISCII code), ready for subsequent processing. An MoU to transfer the OCR technology from Resource Centre at Indian Statistical Institute, Kolkata was signed in August 2002 and accordingly the same was offered in September 2002. This system will recognize all Assamese characters including complex structures such as conjuncts. This OCR System can convert printed traditional Script and new Script from books and machine printed pages into editable text.  A Page layout module has been developed to handle multicolumn and document containing non-text information. 

 


     Snapshot of Assamese OCR GUI

 Features:  Features of the Assamese and Manipuri OCR are as follows

  1. Documents for input                                         : Books & Machine printed documents
  1. Documents printing support                           : Laser, Offset, inkjet, press printed.
  1. Documents Type                                               : Bond, glossy, photocopy
  1. Size of Paper                                                       : any size
  1. Font Size                                                             : 12 pts. To 24 pts                    
  2. Font Support                                                      : Normal & Bold
  1. DPI                                                                      :300     
  1. Scanned Image file format                                : TIFF (8 bit gray scale)
  1. Skew detection & correction                            :-5 deg. To +5 deg.
  1.  Noise Reduction                                               : yes    
  1. Column support                                                 : Single column (multicolumn ready to  integrate)        
  2. Highlights non recognized characters            : No        
  1. Accuracy (character)                                           : 97%
  2. Speed of Conversion                                           : 40 characters per sec. (P4 machine)
  1. Post processing                                                   :Morphological analysis     
  1. Other support                                                     :graphics/Images( not yet integrated)
  2. Output                                                                  : ISCII (.PC-ACII)

    Required System configuration: The application can run on computers with following   

          a.    Pentium III
    b.    32 MB RAM
    c.
        OS Windows 95/98 / NT / 2000 /XP/Linux

19.   Other devices                                                  :  HP scanners preferable

20.   Availability of Documentation                       : Ready