Layout Analysis of Urdu Document Images

Faisal Shafait, Adnan-ul-Hasan, Daniel Keysers, Thomas Breuel
10th IEEE International Multi-topic Conference (INMIC 2006), IEEE Computer Society, 12/2006


Layout analysis is a key component of an OCR system. In this paper, we present a layout analysis system for extracting text-lines in reading order from Urdu document images. For this purpose, we evaluate an existing system for Roman script text on Urdu documents and describe its methods and the main changes necessary to adapt it to Urdu script. The main changes are: 1) the text-line model for Roman script is modified to adapt to Urdu text, 2) reading order of an Urdu document is defined. The method is applied to a collection of scanned Urdu documents from various books, magazines, and newspapers. The results show high text-line detection accuracy on scanned images of Urdu prose and poetry books and magazines. The algorithm also works reasonably well on newspaper images. We also identify directions for future work which may further improve the accuracy of the system.




