Editing Tesseract-ocr

{{Infobox_tool
|purpose=Open source OCR engine, accepting uncompressed TIFF files as input
|image=Tesseract.png
|homepage=http://code.google.com/p/tesseract-ocr/
|license=Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL
|platforms=Linux, Windows, MacOSX
}}

<!-- Delete the Categories that do not apply -->
[[Category:OCR]]



= Description =
Tesseract is probably the most accurate open source OCR engine available. Combined with the [http://leptonica.com/ Leptonica Image Processing Library] it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google.

== Provider ==
Development of Tesseract is sponsored by Google. Its chief developer is [http://research.google.com/pubs/author4479.html Ray Smith].

==Licensing and cost==
Tesseract is an Open Source OCR engine, available under the [http://www.apache.org/licenses/LICENSE-2.0 Apache 2.0 license]. It can be used directly, or (for programmers) using an [http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.h API].

==History==
It was initially developed at HP during a 10 year period from 1984 to 1994. After a decade of minimal development it was released in 2005 for open source. Google acquired Tesseract in 2006 and currently maintains its development.

==Platform and interoperability==
*The latest downloads for Linux and Windows are found on [https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0&usp=sharing GoogleDrive]. Older versions of Tesseract and its language packs are found on the discontinued [https://code.google.com/p/tesseract-ocr/downloads/list Google Code download page].
*The easiest way to install Tesseract on Mac OSX is with [https://www.macports.org/ MacPorts]. Once it is installed, you can install Tesseract by running the command ''sudo port install tesseract'', and any language with ''sudo port install tesseract-<langcode>''. A list of available langcodes can be found on the [https://www.macports.org/ports.php?by=name&substr=tesseract- MacPorts Tesseract page].
*Dependencies for running Tesseract include Autotools and [http://www.leptonica.org/ Leptonica] . The Windows version requires installation of [http://msdn.microsoft.com/en-us/vstudio/aa718325.aspx Visual Studio]. More information about required Ubantu libraries and links to specific requirements are on the [https://code.google.com/p/tesseract-ocr/wiki/Compiling Tesseract Wiki].
*Other programs such as Scan Tailor, unpaper, ImageJ, Gimp or ImageMagick may be needed to properly prepare images for use in Tesseract.

==Functional notes==
===Input supported===
Any image readable by Leptonica is supported in Tesseract including BMP, PNM, PNG, JFIF, JPEG, and TIFF. 
GIF is not supported [http://www.leptonica.com/library-overview.html http://www.leptonica.com/library-overview.html].  
===Output generated===
Tesseract outputs to TXT. PDF output was added in version 3.03.

==Documentation and support==
*Smith, Ray (2007). An Overview of the Tesseract OCR Engine [http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/33418.pdf http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/33418.pdf ]
*Installation information is found on the [https://code.google.com/p/tesseract-ocr/wiki/ReadMe ReadMe] page of the project site.
*Support is offered and issues are addressed on the [https://code.google.com/p/tesseract-ocr/issues/list Issues] page of the project site.

==Included in==
*Integration with the free [http://www.dcc.ac.uk/resources/external/xena-software-0 Xena-Digital Preservation Software][http://sourceforge.net/projects/xena/?source=navbar http://sourceforge.net/projects/xena/?source=navbar]
*Integration with Free Online OCR [http://www.free-ocr.com/faq.html http://www.free-ocr.com/faq.html]

==Usability==
*Tesseract was primarily developed for English OCR capability, but 47 language packs have been developed for use with other languages [https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html]. Tesseract 2.0x and 3.0x are [https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 trainable] for other languages.
*There is no built-in GUI, but there are several available from the [https://code.google.com/p/tesseract-ocr/wiki/3rdParty 3rdParty] page.

= User Experiences =
*Lazorchak, Butch. (2014). Making Scanned Content Accessible Using Full-text Search and OCR [http://blogs.loc.gov/digitalpreservation/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/ http://blogs.loc.gov/digitalpreservation/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/]
*Texas A&M University. (2012-). Early Modern OCR Project Workflow [http://emop.tamu.edu/about http://emop.tamu.edu/about http://emop.tamu.edu/about http://emop.tamu.edu/about]
*Adams, Chris. (2014). Content Search on a Budget-using Tesseract on large TIFF files [http://chris.improbable.org/2014/3/17/content-search-on-a-budget/ http://chris.improbable.org/2014/3/17/content-search-on-a-budget/]
*PSNC Digital Libraries Team. (2011). Tesseract 3.0 installation on Ubuntu 10.10 server [http://dl.psnc.pl/2011/01/24/tesseract-3-0-installation-on-ubuntu-10-10-server/ http://dl.psnc.pl/2011/01/24/tesseract-3-0-installation-on-ubuntu-10-10-server/]
*Lacy, David. (2014). Digital Library upgrade provides enhanced discovery [http://blog.library.villanova.edu/digitallibrary/2014/02/18/digital-library-upgrade-provides-enhanced-discovery/#sthash.OwCtLlEc.dpuf http://blog.library.villanova.edu/digitallibrary/2014/02/18/digital-library-upgrade-provides-enhanced-discovery/#sthash.OwCtLlEc.dpuf]
*Applied in an AQuA Mashup that resulted in the Solution page: [http://wiki.opf-labs.org/display/AQuA/Compare+OCR+results+of+the+same+source+material+in+different+formats+%28TIFF%2C+JP2%29 Compare OCR results of the same source material in different formats (TIFF, JP2)]


= Development Activity =
==Google Code Source Feed==
Below the last 5 source updates:
<rss max=5>https://code.google.com/feeds/p/tesseract-ocr/gitchanges/basic</rss>
==Google Code Wiki Feed==
Below are the last 3 wiki updates:
<rss max=3>https://code.google.com/feeds/p/tesseract-ocr/gitchanges/basic?repo=wiki</rss>
==Google Code Issue Feed==
Below are the last 3 issue updates:
<rss max=3>https://code.google.com/feeds/p/tesseract-ocr/issueupdates/basic</rss>


{{Infobox_tool_details
|ohloh_id=tesseract-ocr
}}