R&D -> Corpus

 Corpora for the two allocated languages have been created.

Assamese Corpora:
A corpus comprising of text from seven Assamese novels has been created. The statistics of the corpus are given below in the table.                            

Number of words


Fonts used

AS-TTDurga (C-DAC font) and Geetanjalilight (A popular font used in DTP work).

Encoding Standard


Manipuri Corpora: The Manipuri corpus was received from Ministry of Information Technology, Govt. of India. Its creation commenced at the Manipur University under the aegis of Prof. M.S.Ningomba and Dr. N.Pramodini. It has around 3 million words. The fonts used in the creation of this corpus are BN-TTDurga, BN-TTBidisha and AS-TTBidisha. The existing corpus is further enhanced with additional 60,000 words. Further investigation to make it compatible to existing systems is in progress.