Home The method Text recognition

Text recognition

In the REPUBLIC project, text recognition was used to convert the scanned resolutions into transcriptions. This makes the resolutions easier to read. We also used the generated ‘computer-readable’ text to structure the material and make it searchable, and to subsequently extract information from it.

The transcriptions were made with ATR (Automatic Text Recognition). In the first phase of the project, we used Transkribus for the handwritten material. With approximately one thousand hand-transcribed, randomly selected pages, we trained the ATR software to be able to read the resolutions. This ‘model’ already performed very well: 97 out of 100 characters were recognised correctly.

With the use of Loghi, ATR software developed within the Humanities Cluster of the KNAW, this score was increased to 98%. The printed resolutions were initially converted into computer-readable text using the OCR software Tesseract. Loghi was later also used for this, thanks to which 99% of the characters were correctly recognised.

Loghi is a freely available machine learning software with a freely accessible source code. This software can learn to make transcriptions based on examples. Loghi’s transcription process consists of three steps (which can in turn be divided into smaller steps): 1) detecting where the text lines and text regions are located; 2) digitally cutting out and transcribing these detected text lines; 3) post-processing, such as merging text and region information and calculating where words are located in the text lines.

In addition, many volunteers have corrected the automatic transcriptions of approximately 95,000 pages via the VeleHanden platform. This means that an almost error-free transcription is available for approximately 18% of all resolution pages.

Related videos: