Editing Tesseract-ocr

{{Infobox_tool
|purpose=Open source OCR engine, accepting uncompressed TIFF files as input
|image=Tesseract.png
|homepage=http://code.google.com/p/tesseract-ocr/
|license=Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL
|platforms=LInux, Windows, MacOSX
}}

<!-- Delete the Categories that do not apply -->
[[Category:OCR]]
[[Category:Quality Assurance]]


= Description =
Tesseract is probably the most accurate open source OCR engine available. Combined with the [http://leptonica.com/ Leptonica Image Processing Library] it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google.

== Provider ==
Development of Tesseract is sponsored by Google. Its chief developer is [http://research.google.com/pubs/author4479.html Ray Smith].

==Licensing and cost==
Tesseract is an Open Source OCR engine, available under the [http://www.apache.org/licenses/LICENSE-2.0 Apache 2.0 license]. It can be used directly, or (for programmers) using an [http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.h API].

==History==
It was initially developed at HP during a 10 year period from 1984 to 1994. After a decade of minimal development it was released in 2005 for open source. Google acquired Tesseract in 2006 and currently maintains its development.

==Platform and interoperability==
*The latest downloads for Linux and Windows may be found on [https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0&usp=sharing GoogleDrive]. Older versions of Tesseract and its language packs may be found on the discontinued [https://code.google.com/p/tesseract-ocr/downloads/list Google Code download page].
*The easiest way to install Tesseract on Mac OSX is with [https://www.macports.org/ MacPorts]. Once it is installed, you can install Tesseract by running the command ''sudo port install tesseract'', and any language with ''sudo port install tesseract-<langcode>''. List of available langcodes can be found onMacPorts tesseract page. A list of available langcodes can be found on the [https://www.macports.org/ports.php?by=name&substr=tesseract- MacPorts Tesseract page].
*Dependencies for running Tesseract on the Linux system include Autotools and [http://www.leptonica.org/ Leptonica] . The Windows version requires installation of [http://msdn.microsoft.com/en-us/vstudio/aa718325.aspx Visual Studio]. More information about required Ubantu libraries and links to specific requirements are on the [https://code.google.com/p/tesseract-ocr/wiki/Compiling Tesseract Wiki].
*Other programs such as Scan Tailor, unpaper, ImageJ, Gimp or ImageMagick may be needed to properly prepare images for use in Tesseract.

==Functional notes==
===Input supported===
Any image readable by Leptonica is supported in Tesseract including BMP, PNM, PNG, JFIF, JPEG, TIFF. 
GIF is not supported [http://www.leptonica.com/library-overview.html http://www.leptonica.com/library-overview.html].  
===Output generated===
Tesseract outputs to TXT. PDF output was added in version 3.03.

==Documentation and support==
*Smith, Ray (2007). An Overview of the Tesseract OCR Engine [http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/33418.pdf http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/33418.pdf ]
*Installation information is found on the [https://code.google.com/p/tesseract-ocr/wiki/ReadMe ReadMe] page of the project site.
*Support is offered and issues are addressed on the [https://code.google.com/p/tesseract-ocr/issues/list Issues] page of project site.

==Usability==
*Tesseract was primarily developed for English OCR capability, but 47 language packs have been developed for use with other languages [https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html]. Tesseract 2.0x and 3.0x are [https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 trainable] for other languages.
*There is no built-in GUI, but there are several available from the [https://code.google.com/p/tesseract-ocr/wiki/3rdParty 3rdParty] page.

= User Experiences =
Applied in an AQuA Mashup that resulted in the Solution page: [http://wiki.opf-labs.org/display/AQuA/Compare+OCR+results+of+the+same+source+material+in+different+formats+%28TIFF%2C+JP2%29 Compare OCR results of the same source material in different formats (TIFF, JP2)]

= Development Activity =

{{Infobox_tool_details
|ohloh_id=tesseract-ocr
}}