2. Introduction: The Problem of Reproducibility in Dynamic Digital Ecosystems

Warning

This is still a draft! Please refer to the PDF version.

2. Introduction: The Problem of Reproducibility in Dynamic Digital Ecosystems#

angels sing, and a light suddenly fills the room.

—Linus Torvalds, “git/README”, April 8, 2005

commits shape history

—Git Guides, “Git commit”, October 29, 2021

“Reproducibility” has never been a central methodological problem for traditional literary studies. Certainly, an interpretation of a poem (for example) should be comprehensible, its arguments plausible, its evidence empirical. But hardly anyone demands that such an interpretation should be “reproducible”, let alone by a researcher other than the one who originally provided the interpretation.

Actually, the claim for reproducibility only enters the field of literary studies when empirical methods are adopted, for example from sociology or psychology. Thus, with the rise of Computational Literary Studies (CLS), which are also committed to an empirical methodology, a new, quite wide field of research has recently opened up in which literary studies are confronted with the problem of reproducibility. Yet “repetitive research” (to use a broader term, following Schöch [2023]) can take very different forms: One might think of the re-implementation of methods and scripts in new research projects; of the re-analysis of data sets with optimized tools; or of the exact re-production of analyses in the course of scientific quality assurance, for example in peer review. In these and many other respects, Computational Literary Studies (but also Computational Humanities and Digital Humanities in general) are facing the demand for reproducibility.

However, according to critical voices, research in the humanities has not adequately met this demand. Alluding to the so-called “replication crisis” [Open Science Collaboration, 2015] in some empirical sciences (particularly in psychology and medicine), James O’Sullivan, for example, stated in 2019 that “the humanities have a ‘reproducibility’ problem” [O’Sullivan, 2019]. In her widely discussed critique of CLS, Nan Z Da pointed out that in several cases it was not possible to reproduce the results of research in this field [Da, 2019]. And in a paper as relevant as it is comprehensive, Christof Schöch recently concluded that when it comes to “reproducibility” there are “serious and relevant challenges for the field of CLS”, “starting with issues of access to data and code, but also concerning questions of lacking reporting standards, limited scholarly recognition, and missing community commitment and capacity that would all be needed to foster a culture of [repetitive research] in CLS and beyond” [Schöch, 2023, p.379].

This report particularly addresses one aspect of the far-reaching disciplinary reproducibility challenge, namely the stabilization of living (and programmable) corpora. What do we mean by that? We’ll explain it in a nutshell: Central to any form of reproductive research is the object to be researched, let’s call it the epistemic object. In CLS, this epistemic object is regularly no longer just an individual text or a small group of individual texts, but a “corpus” and thus an entity that — “across many research domains in the humanities and social sciences” — “has emerged as a major genre of cultural and scientific knowledge” [Gavin, 2023, p.4].[1] Now, there is a large number of corpora that can be fully digitized with manageable resources, for example authors’ corpora (such as all of William Shakespeare’s comedies or all of Henrik Ibsen’s plays). On the other hand, there is also a large number of epistemic objects, i.e. CLS corpora, which cannot be made digitally available so easily. For example, what if someone wants to study all the texts of Scandinavian Romanticism; or the drama of the early modern period in southern Europe; or even just the tragedy of German Enlightenment? In many cases, we don’t even know exactly which texts would have to be included in such corpora. Let alone that these texts are already available in digital form. In all these cases, we must assume that the epistemic object of CLS is currently (and presumably for a long time to come) in the making – in the process of becoming, of growing and thus, in a certain sense, “living”. Speaking of “living corpora” against this backdrop emphasizes that the digitization of our cultural and literary heritage is not so much a state that is or could be achieved, but rather a process, a (permanent) mode of transformation that we have entered. One of the consequences is that these epistemic objects of CLS must be conceptualized as dynamic and yet unstable.[2]

It is precisely this problematic point that our report and the accompanying technical research are addressing. What we have called the problem of the “stabilization of living corpora” can be understood — as the title of our report suggests — as a versioning task: If there is a comprehensive, transparent and fully addressable versioning mechanism for our dynamic epistemic object, then stabilization can be achieved by pointing to a particular version. Furthermore, if corpora are coupled with lightweight tactical research infrastructures in the form of research-driven APIs, as in the case of “Programmable Corpora” (a concept introduced in [Fischer et al., 2019]; see the in-depth explanation in [Börner and Trilcke, 2023], then containerization can be used as an overarching and integrating versioning mechanism.[3]

In the subsequent chapters, we will work through this set of problems and our proposed solutions in the following way:

Chapter 3 (“Versioning Living Corpora Using Git Commits”) will, on the one hand, introduce Git (in its actual implementation in the online service GitHub[4]), a powerful tool for distributed version control, as a way of versioning living corpora; on the other hand, using the GitHub API and the example of the GerDraCor corpus, we will illustrate what kind of additional (technical) metadata about living corpora can be retrieved.

Following chapter 3, we provide an excursus (“An Algorithmic Archaeology of a Living Corpus: GerDraCor as a Dynamic Epistemic Object”) in which we illustrate how Git-based versioning and the metadata that is produced in the course of versioning can be used for what we call corpus archaeology: an approach to the epistemic objects of CLS that treats them as technical objects whose genesis itself can be investigated. Crucial for our argumentation is the excursus because it vividly demonstrates what it means for a corpus to be “living”.

Version control using Git is a viable solution for the stabilization of living corpora, as long as they are “just” data. However, this is not yet a sufficient solution for the stabilization of programmable corpora, as these are in fact combinations of data and code, whereby both are in a reciprocal relationship of co-evolution. As a consequence, the independent stabilization of data on the one hand and code on the other by versioning may not be sufficient in this case. Therefore, in chapter 4 (“Dockerizing DraCor, for Example. On Versioning Programmable Corpora”), we present an approach that versionizes an entire programmable corpus as a bundle, using a containerization mechanism that we call “Dockerizing DraCor”.

In the chapter 4 in particular, we will again follow a prototyping approach, as we have already done in our report “D7.1 On Programmable Corpora” (cf. our explanation in [Börner and Trilcke, 2023, p.9–11]. This also implies that the technical solutions we propose have mostly already been implemented as prototypes in the development work accompanying this report. We indicate where these technical prototypes can be found at the relevant points in this report. The information in chapter 0 (“About this deliverable”) also provides an overview of the accompanying technical prototypes. Overall, our work revolves around the overarching “DraCor” prototype of a programmable corpora ecosystem: “DraCor” is a multicomponent prototype that includes a number of homogenized corpora and several APIs, some of which are document-based and some of which are research-driven; in addition, the DraCor prototype includes exemplary microservices.