tesseract-ocrHow do I create a traineddata file for Tesseract OCR?

Creating a traineddata file for Tesseract OCR requires a few steps:

Generate a font_properties file. This file contains information about the font family, font style, font weight, font size, and font language.

familyname fontname bold italic size language

## Example

Roboto Regular normal normal 48 eng

Generate a box file. This file contains information about the characters in the font. It is generated from the font_properties file.

tesseract fontname.font_properties fontname.box

## Example

tesseract Roboto.font_properties Roboto.box

combine_tessdata -e fontname.traineddata fontname.

## Example

combine_tessdata -e Roboto.traineddata Roboto.

Test the traineddata file. This step is optional, but it is recommended to ensure that the traineddata file is working properly.

tesseract --tessdata-dir . fontname.exp0.tif fontname.exp0 -l fontname

## Example

tesseract --tessdata-dir . Roboto.exp0.tif Roboto.exp0 -l Roboto

The output should be a text file containing the text from the image.

Helpful links