OCRopus is a free document analysis and OCR system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily. OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google. OCRopus is developed for Linux; however, users have reported success with OCRopus on Mac OS X.
Currently, OCRopus uses Tesseract as its only character recognition plugin, but others are expected to be added in the future. This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition engine which may be repaired in the future.
OCRopus itself does image preprocessing and layout analysis; it chops up the scanned document before passing it to Tesseract for line-by-line or character-by-character recognition.
As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST..
Research from J. Vanbeusekom and co-researchers yields new findings on document analysis and recognition.(Report)
Jul 26, 2010; According to recent research from Kaiserslautern, Germany, "In large scale document digitization, orientation detection plays an...