Page Frame Detection for Marginal Noise Removal from Scanned Documents

Faisal Shafait, Joost van Beusekom, Daniel Keysers, Thomas Breuel
Proceedings of 15th SCIA 2007 volume 4522 / 2007, Lecture Notes in Computer Science, Pages 651-660, Aalborg, Denmark, Springer, 6/2007


We describe and evaluate a method to robustly detect the page frame in document images, locating the actual page contents area and removing textual and non-textual noise along the page borders. We use a geometric matching algorithm to find the optimal page frame, which has the advantages of not assuming the existence of whitespace between noisy borders and actual page contents, and of giving a practical solution to the page frame detection problem without the need for parameter tuning. We define suitable performance measures and evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. In addition, we demonstrate that the use of page frame detection reduces the OCR error rate by removing textual noise. Experiments using a commercial OCR system show that the error rate due to elements outside the page frame is reduced from 4.3% to 1.7% on the UW-III dataset.



@inproceedings{ SHAF2007,
	Title = {Page Frame Detection for Marginal Noise Removal from Scanned Documents},
	Author = {Faisal Shafait and Joost van Beusekom and Daniel Keysers and Thomas Breuel},
	BookTitle = {Proceedings of 15th SCIA 2007},
	Month = {6},
	Year = {2007},
	Series = {Lecture Notes in Computer Science},
	Publisher = {Springer},
	Publisher = {4522 / 2007},
	Pages = {651-660}

Last modified:: 30.08.2016