tesseract-ocrHow do I configure the output format of tesseract OCR?

# include tesseract library
import tesseract

# set output format to hocr
api = tesseract.TessBaseAPI()
api.SetPageSegMode(tesseract.PSM_AUTO)
api.SetOutputFormat(tesseract.RIL_HOCR)

# run tesseract with image file
api.SetImageFile('my_image.png')
api.Recognize()

# get output
text = api.GetUTF8Text()

# print output
print(text)

The above example code will configure the output format of tesseract OCR to hOCR (HTML-based Open Document Format for the Recognition of Text). It will also run tesseract with an image file my_image.png and print the output.

The code consists of the following parts:

import tesseract: This imports the tesseract library.
api = tesseract.TessBaseAPI(): This creates an instance of the TessBaseAPI class.
api.SetPageSegMode(tesseract.PSM_AUTO): This sets the page segmentation mode to auto.
api.SetOutputFormat(tesseract.RIL_HOCR): This sets the output format to hOCR.
api.SetImageFile('my_image.png'): This sets the image file to the given file.
api.Recognize(): This runs tesseract on the given image file.
text = api.GetUTF8Text(): This gets the output in UTF-8 encoded text.
print(text): This prints the output.

For more information on configuring the output format of tesseract OCR, please refer to the following links:

Edit this code on GitHub

More of Tesseract Ocr

See more codes...