Toolkit for Generating and Augmenting Hungarian Handwritten Text Recognition Dataset
DOI:
https://doi.org/10.65204/djes.v3i1.315Keywords:
Handwritten text recognition, Hungarian HTR dataset, Deep Learning,, Culture Heritages, DigitizationAbstract
Handwritten Text Recognition (HTR) in low-resource languages, such as Hungarian, faces persistent challenges due to the limited availability of high-quality datasets. This paper introduces the HuHTR-Toolkit, an open-source framework for generating and augmenting Hungarian HTR datasets. The toolkit generates realistic handwritten text images from open-source corpora, enabling the creation of large-scale datasets and offering a wide range of augmentation techniques including mathematical, chromatic, and style-based transformations to enhance model robustness. Using the Belval HuHTR-Toolkit, we produced over three million synthetic text-image line pairs and evaluated their impact on transformer-based models. Data augmentation, the second approach to expanding data, applies image-processing techniques. Experimental results indicate that synthetic data with touch augmentation greatly improves character- and word-level accuracy, reducing reliance on costly human-annotated datasets. The toolkit and generated datasets are publicly available to support further research in low-resource HTR and cultural heritage digitization.