Entity recognition
As a final part of the unlocking process, the resolutions were enriched with annotations that indicate names of places, persons, organisations and other relevant matters in the text. The usual term for such designated matters is named entities. These entities form a kind of digital register on the text of the resolutions. The following types of entities were recognised:
Capacities (functions or roles of persons)
Locations (places, areas, countries)
Committees of the States General
References to other resolutions
The marking of entities took place in two steps. For the first step (the recognition), a number of language models were trained on a selection of resolutions that had been manually annotated by volunteers. These language models then indicated places throughout the corpus where entities are referred to. The ordering, sorting and assigning of these text locations forms the second step (the curation) of the recognition process.
The main challenge to curating entities is the variation in the way in which entities are referred to. There are several causes for this variation. First of all, spelling and text recognition errors generate noise in the names of entities themselves. In addition, spelling and language usage change over the time span covered by the resolutions.
Due to the large number of entities (millions), manual curation of entities was not possible. Although the precise method of processing differs between the different types of entities, the process generally consists of two parts: identifying the entities that appear in the resolutions and assigning text places to the identified entities. The second part always runs automatically; the lists in the first part are usually compiled manually. It was not possible to make the link to a known entity for all text places in this curation step. Text places that could not be identified were not marked, and were thus omitted from the dataset. Incidentally, these usually concern text places that were incorrectly designated by the language model as referring to an entity, or references to entities that only occur once or a few times. See here for further explanation of the specific entity types.
Related videos: