Editing Tesseract-ocr

{{Infobox_tool
|purpose=Open source OCR engine, accepting uncompressed TIFF files as input
|image=[[[File:Tesseract.png]]]
|homepage=http://code.google.com/p/tesseract-ocr/
|license=Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL
|platforms=
}}

<!-- Delete the Categories that do not apply -->
[[Category:OCR]]
[[Category:Quality Assurance]]


= Description =
Tesseract is probably the most accurate open source OCR engine available. Combined with the [http://leptonica.com/ Leptonica Image Processing Library] it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google.

= User Experiences =
Applied in an AQuA Mashup that resulted in the Solution page: [http://wiki.opf-labs.org/display/AQuA/Compare+OCR+results+of+the+same+source+material+in+different+formats+%28TIFF%2C+JP2%29 Compare OCR results of the same source material in different formats (TIFF, JP2)]

= Development Activity =

{{Infobox_tool_details
|ohloh_id=tesseract-ocr
}}
@@ Line 1: / Line 1: @@
-{{Infobox tool
+{{Infobox_tool
-|image=Tesseract.png
 |purpose=Open source OCR engine, accepting uncompressed TIFF files as input
-|homepage=https://github.com/tesseract-ocr/tesseract
+|image=[[[File:Tesseract.png]]]
+|homepage=http://code.google.com/p/tesseract-ocr/
 |license=Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL
-|platforms=Linux, Windows, MacOSX
+|platforms=
-|formats_out=ALTO (Analyzed Layout and Text Object)
-|function=OCR
-|content=Image
 }}
-{{Infobox tool details
-|ohloh_id=tesseract-ocr
-}}
-= Description =
-Tesseract is probably the most accurate open source OCR engine available. Combined with the [http://leptonica.com/ Leptonica Image Processing Library] it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google.
-== Provider ==
+<!-- Delete the Categories that do not apply -->
-Development of Tesseract is sponsored by Google. Its chief developer is [http://research.google.com/pubs/author4479.html Ray Smith].
+[[Category:OCR]]
+[[Category:Quality Assurance]]
-==Licensing and cost==
-Tesseract is an Open Source OCR engine, available under the [http://www.apache.org/licenses/LICENSE-2.0 Apache 2.0 license]. It can be used directly, or (for programmers) using an [https://github.com/tesseract-ocr/tesseract/blob/master/include/tesseract/baseapi.h API].
-==History==
+= Description =
-It was initially developed at HP during a 10 year period from 1984 to 1994. After a decade of minimal development it was released in 2005 for open source. Google acquired Tesseract in 2006 and currently maintains its development.
+Tesseract is probably the most accurate open source OCR engine available. Combined with the [http://leptonica.com/ Leptonica Image Processing Library] it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google.
-==Platform and interoperability==
-*The latest downloads for Linux and Windows are found on [https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0&usp=sharing GoogleDrive]. Older versions of Tesseract and its language packs are found on the discontinued [https://code.google.com/p/tesseract-ocr/downloads/list Google Code download page].
-*The easiest way to install Tesseract on Mac OSX is with [https://www.macports.org/ MacPorts]. Once it is installed, you can install Tesseract by running the command ''sudo port install tesseract'', and any language with ''sudo port install tesseract-<langcode>''. A list of available langcodes can be found on the [https://www.macports.org/ports.php?by=name&substr=tesseract- MacPorts Tesseract page].
-*Dependencies for running Tesseract include Autotools and [http://www.leptonica.org/ Leptonica] . The Windows version requires installation of [http://msdn.microsoft.com/en-us/vstudio/aa718325.aspx Visual Studio]. More information about required Ubantu libraries and links to specific requirements are on the [https://code.google.com/p/tesseract-ocr/wiki/Compiling Tesseract Wiki].
-*Other programs such as Scan Tailor, unpaper, ImageJ, Gimp or ImageMagick may be needed to properly prepare images for use in Tesseract.
-==Functional notes==
-===Input supported===
-Any image readable by Leptonica is supported in Tesseract including BMP, PNM, PNG, JFIF, JPEG, and TIFF.
-GIF is not supported [http://www.leptonica.com/library-overview.html http://www.leptonica.com/library-overview.html].
-===Output generated===
-plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV and ALTO (XML)
-==Documentation and support==
-*Smith, Ray (2007). An Overview of the Tesseract OCR Engine [http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/33418.pdf http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/33418.pdf ]
-*Installation information is found on the [https://code.google.com/p/tesseract-ocr/wiki/ReadMe ReadMe] page of the project site.
-*Support is offered and issues are addressed on the [https://code.google.com/p/tesseract-ocr/issues/list Issues] page of the project site.
-==Included in==
-*Integration with the free [http://www.dcc.ac.uk/resources/external/xena-software-0 Xena-Digital Preservation Software][http://sourceforge.net/projects/xena/?source=navbar http://sourceforge.net/projects/xena/?source=navbar]
-*Integration with Free Online OCR [http://www.free-ocr.com/faq.html http://www.free-ocr.com/faq.html]
-==Usability==
-*Tesseract was primarily developed for English OCR capability, but 47 language packs have been developed for use with other languages [https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html]. Tesseract 2.0x and 3.0x are [https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 trainable] for other languages.
-*There is no built-in GUI, but there are several available from the [https://code.google.com/p/tesseract-ocr/wiki/3rdParty 3rdParty] page.
 = User Experiences =
-*Lazorchak, Butch. (2014). Making Scanned Content Accessible Using Full-text Search and OCR [http://blogs.loc.gov/digitalpreservation/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/ http://blogs.loc.gov/digitalpreservation/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/]
+Applied in an AQuA Mashup that resulted in the Solution page: [http://wiki.opf-labs.org/display/AQuA/Compare+OCR+results+of+the+same+source+material+in+different+formats+%28TIFF%2C+JP2%29 Compare OCR results of the same source material in different formats (TIFF, JP2)]
-*Texas A&M University. (2012-). Early Modern OCR Project Workflow [http://emop.tamu.edu/about http://emop.tamu.edu/about http://emop.tamu.edu/about http://emop.tamu.edu/about]
-*Adams, Chris. (2014). Content Search on a Budget-using Tesseract on large TIFF files [http://chris.improbable.org/2014/3/17/content-search-on-a-budget/ http://chris.improbable.org/2014/3/17/content-search-on-a-budget/]
-*PSNC Digital Libraries Team. (2011). Tesseract 3.0 installation on Ubuntu 10.10 server [http://dl.psnc.pl/2011/01/24/tesseract-3-0-installation-on-ubuntu-10-10-server/ http://dl.psnc.pl/2011/01/24/tesseract-3-0-installation-on-ubuntu-10-10-server/]
-*Lacy, David. (2014). Digital Library upgrade provides enhanced discovery [http://blog.library.villanova.edu/digitallibrary/2014/02/18/digital-library-upgrade-provides-enhanced-discovery/#sthash.OwCtLlEc.dpuf http://blog.library.villanova.edu/digitallibrary/2014/02/18/digital-library-upgrade-provides-enhanced-discovery/#sthash.OwCtLlEc.dpuf]
-*Applied in an AQuA Mashup that resulted in the Solution page: [http://wiki.opf-labs.org/display/AQuA/Compare+OCR+results+of+the+same+source+material+in+different+formats+%28TIFF%2C+JP2%29 Compare OCR results of the same source material in different formats (TIFF, JP2)]
 = Development Activity =
-Source Commits : [https://github.com/tesseract-ocr/tesseract/commits https://github.com/tesseract-ocr/tesseract/commits]
-Issues : https://github.com/tesseract-ocr/tesseract/issues
+{{Infobox_tool_details
+|ohloh_id=tesseract-ocr
+}}