A Simple and Effective Approach for Border Noise Removal from Document Images

Faisal Shafait, Thomas Breuel
13th IEEE International Multi-topic Conference, Islamabad, Pakistan, IEEE, 12/2009


When digitizing bound material like books or magazines, marginal noise appears along the page border. This noise consists of undesired text parts from the neighboring page and/or speckles that result from the binarization process. When a keyword based search is performed in a digitized collection, textual noise in particular poses problems since the returned search results might correspond to textual noise instead of actual contents of the page. Manually removing marginal noise for each page is not feasible in large scale digitization projects. In this paper, we present a simple and effective approach for removing both textual and non-textual noise by finding borders of noise regions using projection profile analysis. We demonstrate the effectiveness of our approach by evaluating it quantitatively on the widely used University of Washington (UW3) dataset. The results show that our approach reduces the noise ratio from 70% to 20% while retaining more than 99% of actual page contents. Comparison with state-of-the-art approaches shows that our algorithm performs comparable to them, while being simple to understand and easy to implement. We also provide an open source implementation of our method as part of the OCRopus OCR system.




