Data collection and documentation
The data acquisition process is essential to the project. Upstream tasks included identifying institutions, searching for objects corresponding to the project’s scope, and establishing partnerships with institutions. This document provides an overview of the EyCon team’s processes for gathering the data presented on the platform.
The identification of the materials that form the project’s initial database has been based on several criteria. Various media formats were included: photographic albums, loose photographic collections, and books, as well as newspapers and magazines. The format of the photographs themselves is likewise diverse: they may be silver prints, citrate gelatin aristotypes, stereoscopic views,or photographs published in newspapers via engraving or photomechanical processes.
The chosen temporal boundaries begin in 1880 and close with the end of the First World War: all the materials included in the corpus fall between these two benchmarks. Geographical considerations were also part of the identification of the sources. The project attempts to de-Europeanise the representation of the conflicts as much as possible, with the aim of discovering collections that are seldom displayed or disseminated and therefore little known to the research community and to the public. The subject of the documents must be conflict or war, particularly in a colonial context, but photographs representing everyday life during conflict are also of interest and are included in the database.
The actual acquisition of photographs continues throughout the project. Partnerships with various institutions have allowed us to directly obtain the files and their associated metadata according to pre-established agreements between institutions and the team.
For libraries that make digital content available online, it is possible to retrieve the data through the Search and Retrieval APIs. To do so, it is necessary to define the search terms that correspond to the materials sought. These are not necessarily characterised in the same way or by the same vocabulary from one institution to another. The names of conflicts are sometimes elided, as are the names of troops in action, which complicates the search by term. It was therefore necessary to search the digital libraries manually to identify the search markers that allowed us to automatically retrieve materials corresponding to our criteria (subject, dates, places).
The third type of acquisition involves the digitisation of documents held by institutions or purchased as part of the project. Making these unpublished documents available seemed essential. Interns working for the EyCon project have made significant digitisation possible and have created rich and detailed metadata. A suitable digitisation process making materials usable for the project and the community, combined with the automatic creation of digital texts, has augmented the data produced.
Several institutions allow the project to use and freely distribute the digital documents acquired within the project’s framework for all scholarly purposes. This is the case of La Contemporaine, the Service Historique de la Défense, the Bibliothèque Nationale de France, the Archives Nationales, the Archives Nationales d'Outre-Mer and the National Library of Scotland. The documents provided by the Imperial War Museum allow for use with due regard to the ethical considerations around the photographs, for a period of 10 years after the end of the project. Additionally, the published images are watermarked by the institution. This is also the case for the images transmitted by ECPAD, which can only be published in low definition.
In order to create a homogeneous database that is as rich and detailed as possible, all the metadata collected are standardised in EAD format. This format was chosen because it allows greater freedom in adding information and it allows the identification of the source of each point of metadata added to the image. EyCon preserves the structures and data produced by the participating institutions to reflect the way each institution records, stores and presents its own data. A more general Dublin Core format will then be created to allow the data to be re-distributed on a wider scale. The metadata collected was in a variety of formats: some were in XML and some were in Excel, while for other documents no metadata had been produced. Standardisation to XML-EAD format also allows interoperability of the data, as well as linking it to JSON format, which will be the format for the results of the computation performed on the documents, and will facilitate their analysis.
Metadata is increasingly created automatically: via OCRisation of documents but above all thanks to computer vision tools. Indeed, the analysis of the layout of newspapers, magazines and albums in particular will make it possible to enrich the captions associated with the extracted images. The linking of images will also allow the pre-existing metadata to be self-completing, while preserving the history of each piece of information. Additionally, object detection will enrich the description of the photographs’ content, and thus enrich the database’s search possibilities.
It is this rich metadata that allows the search engine (SolR) to harvest the information and expand the possibilities for exploring the corpus. Additionally, the descriptors used allow for the creation of an ontology specific to the project based on existing ontologies (primarily ICONCLASS, despite its dissonance with the social issues of the project - ontologies based on a Western European scope). Each document’s essential information is retained: the creator/photographer, the date of creation, the provenance (the institution of origin), a description of the document, the OCR of any textual information on the document, the type of document, its dimensions, the captions created, the production data, and the different layers of each document.
The database has been designed around the layers of the documents collected. In fact, in order to preserve as much of the physical document’s initial information as possible, the data model is based on the specificity of the formats included, distinguishing between the collections, and between the collection subdivisions (photographic albums, issues of periodicals, books), the pages of each document and the photographs extracted from each page. To allow easy document recognition in the vast catalogue created for the project, each document is tagged with a unique identifier used in all stages of data storage: from the physical storage of the files to the registration in the metadata. The naming rules are based on the type of document (al: album, np: newspapers, book: book, pho: photos), followed by the identifier assigned to each document (and defined in a repository), followed by the page number, the number of the extracted photograph, as well as the format (.JPG for images - .XML and .JSON for the corresponding data files). Thanks to these rules, the file names/IDs have been standardised so that no conflicts can arise during computation and metadata creation.
The database is stored on a 'Huma-Num Box' dedicated to the project. OmékaS. For the unique materials created within the framework of the project (e.g. scans), a repository on Nakala will ensure the data’s durability. More generally, the acquisition, modification and normalisation of large-scale data is done via tools created specifically for the project in Python.
The database and its dissemination aim to meet the FAIR data principles of digital humanities: "easy to find, accessible, interoperable and reusable". The aim is to allow greater visibility of unknown and little-exploited materials. Within this framework, the transformation of the documents into IIIF format is planned for the final publication of the project, in the summer of 2023.