Recognizable Units in Pashto Language for OCR

Riaz Ahmad, Muhammad Zeshan Afzal, Sheikh Faisal Rashid, Marcus Eichenberger-Liwicki, Andreas Dengel, Thomas Breuel
In: IEEE Xplore (ed.) MOCR, Nancy, France, other, 8/2015

Abstract:

Atomic segmentation of cursive scripts into con- stituent characters is one of the most challenging problems in pattern recognition. To avoid segmentation in cursive script, concrete shapes are considered as recognizable units. Therefore, the objective of this work is to find out the alternate recognizable units in Pashto cursive script. These alternatives are ligatures and primary ligatures. However, we need sound statistical analysis to find the appropriate numbers of ligatures and primary ligatures in Pashto script. In this work, a corpus of 2, 313, 736 Pashto words are extracted from a large scale diversified web sources, and total of 19, 268 unique ligatures have been identified in Pashto cursive script. Analysis shows that only 7000 ligatures represent 91% portion of overall corpus of the Pashto unique words. Similarly, about 7, 681 primary ligatures are also identified which represent the basic shapes of all the ligatures.

Files:

  login.jsp

BibTex:

@inproceedings{ AHMA2015,
	Title = {Recognizable Units in Pashto Language for OCR},
	Author = {Riaz Ahmad and Muhammad Zeshan Afzal and Sheikh Faisal Rashid and Marcus Eichenberger-Liwicki and Andreas Dengel and Thomas Breuel},
	Editor = {IEEE Xplore},
	BookTitle = {MOCR},
	Month = {8},
	Year = {2015},
	Publisher = {other}
}

     
Last modified:: 30.08.2016