Extracting process information from archival records

Submitted by Isto Huvila on Fri, 05/20/2022 - 10:20

Presentation together with Ekta Vats, Zanna Friberg, Lisa Börjesson, Jessica Kaiser and Olle Sköld at Final conference of the international network Digitization and the Future of Archives: Digital archives, Big Data and Memory in Copenhagen.


Apart from the lack of information on what archival records are about—described using metadata—there is an increasing awareness of that the lack of understanding of the contexts and processes of how records were created and how they have been manipulated (i.e. data about creation, curation and use processes, or paradata). This poses a significant hindrance to their effective management, preservation, findability and use. However, typically the records themselves contain a lot of information that qualifies as paradata. The problem is that it is dispersed throughout the material and can be difficult to find and use. Moreover, paradata can be identified in text, images (incl. photographs and drawings) and tabular data in the records. This presentation reports findings from a pilot project that investigates how AI-based text and image analysis techniques can be used for mining paradata from archival records pertaining to archaeological excavations. The talk describes how the developed approach is promising in extracting meaningful information on how records and their contents have been created and processed. Further, the presentation outlines key lessons learned during the development and implementation analysis workflow. The heterogeneity of records and especially that of the expressions of paradata causes problems for computational analysis but considering that they also slow down manual processing of the data, the approach discussed in the project emerges as successful. The reported work is a part of the research project CApturing Paradata for documenTing data creation and Use for the REsearch of the future (CAPTURE) that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 818210 and InterPARES Trust AI funded by a Canadian SSHRC grant. The work has also received funding from the Centre for Digital Humanities Uppsala (CDHU) pilot project scheme.