tesseract-ocrHow do I configure the output format of tesseract OCR?
# include tesseract library
import tesseract
# set output format to hocr
api = tesseract.TessBaseAPI()
api.SetPageSegMode(tesseract.PSM_AUTO)
api.SetOutputFormat(tesseract.RIL_HOCR)
# run tesseract with image file
api.SetImageFile('my_image.png')
api.Recognize()
# get output
text = api.GetUTF8Text()
# print output
print(text)
The above example code will configure the output format of tesseract OCR to hOCR (HTML-based Open Document Format for the Recognition of Text). It will also run tesseract with an image file my_image.png and print the output.
The code consists of the following parts:
import tesseract: This imports the tesseract library.api = tesseract.TessBaseAPI(): This creates an instance of the TessBaseAPI class.api.SetPageSegMode(tesseract.PSM_AUTO): This sets the page segmentation mode to auto.api.SetOutputFormat(tesseract.RIL_HOCR): This sets the output format to hOCR.api.SetImageFile('my_image.png'): This sets the image file to the given file.api.Recognize(): This runs tesseract on the given image file.text = api.GetUTF8Text(): This gets the output in UTF-8 encoded text.print(text): This prints the output.
For more information on configuring the output format of tesseract OCR, please refer to the following links:
More of Tesseract Ocr
- How do I set the Windows path for Tesseract OCR?
- How can I use Tesseract OCR with Xamarin?
- How can I use Tesseract to perform zonal OCR?
- How do I use Tesseract OCR on macOS?
- How can I use Tesseract OCR to get the position of text?
- How can I use Tesseract OCR with Golang?
- How do I add Tesseract OCR to my environment variables?
- How do I use tesseract OCR to create bounding boxes?
- How do I install Tesseract OCR on Windows?
- How to install and use Tesseract OCR on Ubuntu 22.04?
See more codes...