The hOCR Microformat for OCR Workflow and Results

Thomas Breuel
Proceedings of the Ninth International Conference on Document Analysis and Recognition volume 2, Pages 1063-1067, Curitiba, Brazil, IEEE Computer Society, Washington, DC, 2007

Abstract:

Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format is defined as a microformat on top of the HTML and CSS standards and therefore can represent a wide range of linguisitic and typographic phenomena with al- ready well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typeset- ting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.

Files:

  The hOCR Microformat.pdf

BibTex:

@inproceedings{ BREU2007,
	Title = {The hOCR Microformat for OCR Workflow and Results},
	Author = {Thomas Breuel},
	BookTitle = {Proceedings of the Ninth International Conference on Document Analysis and Recognition},
	Year = {2007},
	Publisher = {IEEE Computer Society},
	Publisher = {2},
	Pages = {1063-1067}
}

     
Last modified:: 30.08.2016