HTR-United
What is HTR-United?
HTR-United is an open community initiative that provides a centralized catalogue and set of standards for sharing Handwritten Text Recognition (HTR) and OCR training datasets and models. It was founded to address a chronic problem in the field: researchers routinely produce high-quality training data and models for specific projects, then publish them in incompatible formats under unclear licenses, making reuse difficult or impossible.
HTR-United is not a platform but a standard plus a catalogue:
- A metadata schema (HUMU — HTR-United Metadata Updater) for describing datasets and models in a machine-readable way
- A GitHub organization hosting curated repositories of ground-truth data
- A searchable catalogue website at htr-united.github.io
HTR-United is the primary discovery layer for open HTR datasets. If you are looking for training data or ground truth for a specific script, language, or period, the HTR-United catalogue is the first place to search.
The Problem HTR-United Solves
Before HTR-United, researchers publishing HTR data faced no standard for:
- File formats — PAGE XML, ALTO XML, or plain text
.gt.txtfiles alongside images - Metadata — what script, language, time period, repository, license, and transcription policy applied
- Versioning — how to track updates and cite a specific version
- Discoverability — data scattered across personal GitHub repos, Zenodo deposits, and project websites
HTR-United provides a common vocabulary and schema that makes datasets findable, comparable, and citable.
The HUMU Metadata Schema
Each dataset in the HTR-United catalogue is described by a htr-united.yml file following the HUMU schema. Key fields include:
schema: https://htr-united.github.io/schema.json
version: "1.0"
title: "My HTR Dataset"
url: https://github.com/example/my-dataset
authors:
- name: "Smith, Jane"
orcid: "0000-0000-0000-0000"
description: "Ground truth for 15th-century Latin charters."
language:
- iso: la
label: Latin
script:
- iso: Latn
label: Latin
script-type: only-manuscript
time:
notBefore: "1400"
notAfter: "1500"
license:
- name: CC BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
transcription-guidelines: "Diplomatic; abbreviations not expanded."
volume:
- metric: lines
count: 5000This structured metadata enables automatic aggregation across repositories and powers the catalogue’s search interface.
The HTR-United Catalogue
The htr-united.github.io website provides:
- A searchable table of all registered datasets, filterable by language, script, time period, and license
- Statistics on the total volume of ground truth available per language and script
- Direct links to each dataset’s GitHub repository and Zenodo archive
As of 2025, the catalogue contains several hundred datasets covering scripts from Latin and Greek to Hebrew, Arabic, and vernacular European hands.
Ground Truth Formats
HTR-United datasets follow one of two standard formats:
Kraken / eScriptorium format
Line images stored as cropped PNGs paired with ground truth text files:
line_001.png
line_001.gt.txt
line_002.png
line_002.gt.txt
This format is directly usable with ketos train (Kraken’s training command).
PAGE XML / ALTO XML
Full-page images with structural annotations in PAGE or ALTO XML. These can be compiled into Kraken training format using ketos compile.
Using HTR-United Data with Kraken
Once you have downloaded a dataset from HTR-United, training a Kraken model is straightforward:
# Compile PAGE XML ground truth into Kraken binary format
ketos compile -f alto -o train.arrow *.xml
# Train a new model (or fine-tune from a base)
ketos train \
--load base_model.mlmodel \
--resize new \
-o my_model \
train.arrowThe --resize new flag is important when fine-tuning: it allows the model to learn new characters not present in the base model’s training data.
Linking Models to Data
One of HTR-United’s key goals is to connect published models to their training data, enabling reproducibility. When a model is published on Zenodo alongside an HTR-United dataset, reviewers and future users can:
- Inspect exactly what data was used
- Reproduce the training run
- Evaluate the model on held-out subsets
The Inzigkofen models and the OICEN-HTR bundle are good examples of this practice done well.
Transcription Policy Documentation
HTR-United encourages — but does not yet enforce — explicit documentation of transcription policies. The CATMuS project publishes its guidelines at catmus-medieval.github.io and the MiDRASH project similarly documents its annotation rules on Zenodo. This matters enormously: two datasets labeled “diplomatic” may encode abbreviations in entirely different ways.
Before fine-tuning with external data: Always check the transcription policy of any HTR-United dataset you plan to combine with your own ground truth. Mixing graphematic and semi-diplomatic data in the same training set typically degrades model quality.
Contributing to HTR-United
To register your dataset:
- Structure your data as a GitHub repository with line images and
.gt.txtfiles (or PAGE/ALTO XML) - Add a
htr-united.ymlmetadata file to the root of the repository - Open a pull request against the HTR-United catalogue repository
The HUMU validator checks your metadata file for schema compliance before the PR can be merged.
Key Links
| Resource | URL |
|---|---|
| Catalogue website | https://htr-united.github.io/ |
| GitHub organization | https://github.com/HTR-United |
| HUMU schema | https://htr-united.github.io/schema.json |
| HUMU validator | https://github.com/HTR-United/htr-united/tree/main/validator |
| CATMuS guidelines | https://catmus-medieval.github.io/ |