Toolkit for Generating and Augmenting Hungarian Handwritten Text Recognition Dataset

Authors

  • Mohammed Al-Hitawi University of Fallujah Author
  • M A Alazaizi Faculty of Management and Informatics, at Technical University of Munich Author

DOI:

https://doi.org/10.65204/djes.v3i1.315

Keywords:

Handwritten text recognition, Hungarian HTR dataset, Deep Learning,, Culture Heritages, Digitization

Abstract

Handwritten Text Recognition (HTR) in low-resource languages, such as Hungarian, faces persistent challenges due to the limited availability of high-quality datasets. This paper introduces the HuHTR-Toolkit, an open-source framework for generating and augmenting Hungarian HTR datasets. The toolkit generates realistic handwritten text images from open-source corpora, enabling the creation of large-scale datasets and offering a wide range of augmentation techniques including mathematical, chromatic, and style-based transformations to enhance model robustness. Using the Belval HuHTR-Toolkit, we produced over three million synthetic text-image line pairs and evaluated their impact on transformer-based models. Data augmentation, the second approach to expanding data, applies image-processing techniques. Experimental results indicate that synthetic data with touch augmentation greatly improves character- and word-level accuracy, reducing reliance on costly human-annotated datasets. The toolkit and generated datasets are publicly available to support further research in low-resource HTR and cultural heritage digitization.

Downloads

Published

2026-03-22