```{warning}
This is still a draft! Please refer to the PDF version.
```

# 4. Dockerizing DraCor, for Example. On Versioning Programmable Corpora

```{epigraph}
I make no claims to elegance in programming, 
but I am confident that the scripts work, 
at least as of today.

-- Andrew Piper, Foreword to *Enumerations*
```


(section-4-1)=
## 4.1 Containerizing a Research Environment

As already noted in the introduction (see {ref}`section-2`), the use of a version control system like Git  can be seen as both a powerful and user-friendly mechanism to stabilize “living” corpora. However, “programmable” corpora consist of a connection of data ("living corpus") with a lightweight research software (in the form of a research-driven API) whereby this connection can be conceptualized as a distributed research infrastructure in a dynamic digital ecosystem (cf. {cite:t}`boerner_2023_report`). Such a research infrastructure can no longer be stabilized only via the versioning mechanism of Git, since in addition to the two components (data, code), their specific connection is also crucial for the possibility of reproducing the research conducted with it. For such scenarios, as will be shown below, a container-based approach is suitable, whereby we will rely on containerization using the [Docker](https://www.docker.com) software.

Our approach is inspired by the concept of research artifacts from computer science, which is described by Arvan et al. as “self-contained packages that contain everything needed to reproduce results reported in a paper" and which are also "typically self-executable, meaning that they are packaged within a virtual machine \[…\] or within a container" {cite:p}`arvan_2022_reproducibility`. The prototypical implementation, on which we will report in the following, is again based on the Drama Corpora Platform, DraCor. In a DraCor-based experiment, we set ourselves the challenge of making a network-analytic study, which we conducted for a paper publication, as fully replicable as possible using Docker.

The containerization technology "Docker" is widely used in the IT industry, because it can speed up development cycles and can reduce overhead when deploying applications. Especially in the field of “DevOps” – “Dev” for development and “Ops” for operations, which refers to the processes that are necessary to have an application run on a server – the importance of communication between the development and the operations team is considered highly important. Thus Docker workflows have been introduced. They do not only streamline communication processes, they also shift the responsibility of handling software dependencies to the development team: Docker enables the people actually writing the application to specify the environment in which their software should be run. There are a couple of key components to such workflows: The so-called “Dockerfile”[^dockerfile] is a meaningful and executable form of documentation. It contains the steps necessary to build a highly portable, self-contained digital artifact, a “Docker image”[^dockerimage]. These images can be easily deployed on a designated infrastructure as “Docker containers”[^dockercontainer].

[^dockerfile]: In the [glossary](https://docs.docker.com/glossary) available as part of the [Docker documentation](https://docs.docker.com) the Dockerfile is explained as such: “A Dockerfile is a text document that contains all the commands you would normally execute manually in order to build a Docker image. Docker can build images automatically by reading the instructions from a Dockerfile.” [https://docs.docker.com/glossary/#dockerfile](https://docs.docker.com/glossary/#dockerfile); also see the [documentation of the Dockerfile](https://docs.docker.com/reference/dockerfile).

[^dockerimage]: “Docker images are the basis of containers. An image is an ordered collection of root filesystem changes and the corresponding execution parameters for use within a container runtime. An image typically contains a union of layered filesystems stacked on top of each other. An image does not have state and it never changes.” [https://docs.docker.com/glossary/#image](https://docs.docker.com/glossary/#image) 

[^dockercontainer]: see [https://docs.docker.com/glossary/#container](https://docs.docker.com/glossary/#container).

Certainly, in CLS research projects we will rarely find teams of development and operation professionals that are in need of communicating better. But, we would like to argue that still the attempt of reproducing research could be framed in a similar sense: On the one hand we have an individual researcher (or a team of researchers) that conducts a study. In our analogy, these are the developers. Their product – a study – relies on some application or script that operates on data. On the other hand we have researchers wanting to reproduce or verify the results, similar to operations professionals that have to deploy someone else's application on a server.

It becomes evident that some hurdles in the process of reproducing CLS research exist due to a lack of clear communication on how to run the analysis scripts and a tendency to offload the responsibility of setting up an environment in which the analysis could be executed to the reproducing party. A containerized research environment might circumvent these problems: Instead of claiming that scripts “work, at least as of today” on the machine of the developing researcher, as for example, Andrew Piper writes it in his foreword to his book “Enumerations” {cite:p}`piper_2018_enumerations {xii}`; when employing container technology, it could be guaranteed instead that a container created from an image containing a runnable self-contained research environment can be re-run. This would allow for a reproduction of the study. Instead of saying: The analysis was run-able on my machine when writing the paper and publishing the code only, in addition a researcher could provide a run-able research artifact alongside the study. When using Docker, this could be one or more images that would allow to re-create the research environment the study was conducted in.[^open_container_initiative]

[^open_container_initiative]: On a side note and because someone might argue that by embracing a proprietary technology: “Linux containers” – what Docker is at its core – as a technology have been available since 2008. “Docker” was only introduced in 2013 and the company Docker Inc. actively promoted the “Open Container Initiative” (OCI, see [https://opencontainers.org](https://opencontainers.org)), which was started in 2015. It developed a vendor agnostic specification of containers and images, that was released in 2017 to which the current Docker implementation adheres. This, of course, still doesn’t guarantee the longevity of the technology, but at least a total lock-in into a certain technology is circumvented.

In the following, we will report on our exemplary experiment in making a CLS study replicable by using Docker technology.[^graz_presentation]

[^graz_presentation]: This experiment has also been presented at the “DH2023” conference in Graz. Cf. {cite:p}`boerner_2023_dockerizing`. See also the [slides of the presentation](https://zenodo.org/records/8183676). For a comprehensive evaluation of the use of container technology in research and digital publishing see {cite:p}`burton_2020_digits`.

(section-4-2)=
## 4.2 Case Study: Dockerizing a Complete CLS Study

We exemplify the benefits of a Docker-based research workflow by referring to our study “Detecting Small Worlds in a Corpus of Thousands of Theater Plays” {cite:p}`trilcke_2024_small-worlds`. In this study, we tested different operationalizations of the so-called “Small World” concept based on a multilingual “Very Big Drama Corpus” (VeBiDraCor) of almost 3,000 theater plays. As explained above, the corpora available on DraCor are “living corpora” – which means that both the number of text files contained and the information contained in the text files changes (e.g. with regard to metadata or mark-up). This poses an additional challenge for reproducing our study. Furthermore, our analysis script (written in R) retrieves metadata and network metrics from the REST API of the “programmable corpus”. Thus, we had to devise a way of not only stabilizing the corpus but also the API.

DraCor provides Docker images for its services, which are, at its core the API, a frontend, a metrics service, that does the calculation of the network metrics of co-presence networks of the plays, and a triple store. The ready-made Docker images can be used to set up a local DraCor environment.[^dracor_docker_images]

[^dracor_docker_images]: Docker images of the DraCor system components are published on the platform [DockerHub](https://hub.docker.com/u/dracor). More recent images are `dracor/api` ([Repository](https://hub.docker.com/r/dracor/api/tags)), `dracor/frontend` ([Repository](https://hub.docker.com/r/dracor/frontend/tags)), `dracor/metrics` ([Repository](https://hub.docker.com/r/dracor/metrics/tags)) and `dracor/fuseki` ([Repository](https://hub.docker.com/r/dracor/fuseki/tags)). These images are used in the production infrastructure and ideally should be used when setting up local instances as well. For reasons of allowing for replication of work that has, for example, been presented at the DH 2023 Graz conference, the image repositories with the naming convention that repeats “dracor” in the image name, e.g. `dracor/dracor-api` ([Repository](https://hub.docker.com/r/dracor/dracor-api/tags)) are still kept on the platform.

For VeBiDraCor we devised a workflow that spins up a Docker container from a versioned bare Docker image of the DraCor database[^vebidracor_image] and ingests the data of the plays downloaded (“pulled”) from specified GitHub commits using a Python script[^vebidracor_data_ingested]. We then committed this container with `docker commit`[^commit_vebidracor_container] to create a ready to use [Docker image of the populated database and API](https://hub.docker.com/layers/ingoboerner/vebidracor-api/3.0.0/images/sha256-0b39f125534f7583bbb18b58379eb98312040eb04dd74548fb563d916a172d4c?context=explore) (see  {numref}`frontend_loaded_vebidracor`). 

[^vebidracor_image]: The initial local infrastructure is defined in this [Docker Compose file](https://github.com/dracor-org/vebidracor/blob/3c3495d6b9434913687348435a341f781413304d/docker-compose.empty.yml). The Docker image of the eXist-DB containing the API as an eXist-DB application is pulled from this [repository on DockerHub](https://hub.docker.com/layers/ingoboerner/dracor-api/v0.86.3_local/images/sha256-99978fd573262968de665421a9ed2e7deee2d1199be15761013f54ead834635e?context=explore).

[^vebidracor_data_ingested]: See the section "Define plays to load" in the [Jupyter Notebook](https://github.com/dracor-org/vebidracor/blob/3c3495d6b9434913687348435a341f781413304d/vebidracor-workflow.ipynb) that was used to populate the database.

[^commit_vebidracor_container]: The necessary steps are contained in the [VeBiDraCor Workflow notebook](https://github.com/dracor-org/vebidracor/blob/3c3495d6b9434913687348435a341f781413304d/vebidracor-workflow.ipynb) in the section "Prepare container image". See also the [documentation](https://docs.docker.com/reference/cli/docker/container/commit) of the `docker commit` command.

% Figure is rendered in the HTML output here

```{figure} ./images/frontend_loaded_vebidracor.png
---
width: 600px
name: frontend_loaded_vebidracor
---
Frontend of the local DraCor infrastructrue with added corpus VeBiDraCor
```

Because the build process is modular and documented in a Dockerfile, it is also possible to quickly change the API’s base image or the composition of the corpus by editing a manifest file that controls which plays from which repositories at which state are included.

In a second step, we also dockerized the research environment: a Docker container running [RStudio](https://posit.co/products/open-source/rstudio) to which we added our [analysis script](https://github.com/dracor-org/small-world-paper/blob/ddf85c6a5c5d32004439520e6d5984cb40d7bad3/smallworlds-script.R). The preparation of this image is documented in a [Dockerfile](https://github.com/dracor-org/small-world-paper/blob/5622479f8649be81e3fa50bf8cc9ce5d3d44da5d/Dockerfile). As base image we used an image of the [rocker-project](https://rocker-project.org/). We used `docker commit` to “freeze” this state of our system and published all images. We call this state the “pre-analysis state”, which is documented in a [Docker Compose file](https://github.com/dracor-org/small-world-paper/blob/bef7eadef788775a8b0a8e8351ff9d5249cbb65f/docker-compose.pre.yml).

[^docker_commit_docu]: [https://docs.docker.com/reference/cli/docker/container/commit](https://docs.docker.com/reference/cli/docker/container/commit)

After we ran the analysis, we again created an image of the RStudio container with `docker commit`, thus turning it into a Docker image in which basically “froze” the state of the research environment after the R-script was run. The [image of this “post-analysis state”](https://hub.docker.com/layers/ingoboerner/smallworld-rstudio/dcac262/images/sha256-03bec767bdc213a002783e2d0b34d896dce308ddfc243b90ef13b9292a972c54?context=explore) was also published on the DockerHub repository. It allows for inspection and verification of the results of our study in the same environment that we used (see {numref}`rstudio_post-analysis-state`).[^smallworld_code_repo]

[^smallworld_code_repo]: The workflow described above is also documented in the `README.md` file in the [repository accompanying the “Small-World” study](https://github.com/dracor-org/small-world-paper/tree/publication-version). The original VeBiDraCor was built with [this Jupyter notebook](https://github.com/dracor-org/vebidracor/blob/3c3495d6b9434913687348435a341f781413304d/vebidracor-workflow.ipynb).

% Figure is rendered in the HTML output here

```{figure} ./images/rstudio_post-analysis-state.png
---
width: 600px
name: rstudio_post-analysis-state
---
RStudio after the analysis was run
```

By having these images representing two moments in the course of our analysis, we not only make our analysis transparent, we also allow for different scenarios of repeating our research. For example, starting an environment with the Docker Compose file that documents the “pre-analysis state” would allow a researcher to exactly repeat our analysis by re-running the script on the exact same data.

But also other scenarios of repeating research (e.g. "replication", "reproduction", "revision", "reanalysis", "reinvestigation"; cf. {cite:p}`schoech_2023_repetitive-research`) could be implemented easily. To give one example: A researcher could adapt the Jupyter Notebook we used to assemble “VeBiDraCor” and create an image of the local corpus container the same way we did. By changing a single line in the Docker Compose file documenting the “pre-analysis state” it is possible to start the whole system with this different data. For running our R-script to analyze this data he or she could still use a container created from our RStudio image and thus run the analysis the exact same way we did, but on different data.

The following code cell demonstrates how to re-create the intrastructure that was used to generate the results reported in the paper.[^run_docker_commands_in_notebook]

[^run_docker_commands_in_notebook]: Running the following cell will work if this "Executeable Report" is run with the "Docker in Docker" setup as described in the `README.md` file. Using the [magic cell command](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics) `%%script bash --bg` will run the commands in a shell in the background directly from this Jupyter notebook. Change the first line to `%bash` if the output should be seen here but this will halt the execution of all the following cells. It will take some time until the DraCor frontend of the local infrastructure becomes available at [http://localhost:8088](http://localhost:8088). To check the status use the command `docker ps -a` and look for the `STATUS` of the container based on the image `ingoboerner/vebidracor-api:3.0.0`. In the "Docker in Docker" setup there is still an issue with accessing the RStudio Container at (http://localhost:8787)[http://localhost:8787] which has not been resolved yet.

In [None]:
%%script bash --bg

# Clone the GitHub repository containing the data of the study

git clone https://github.com/dracor-org/small-world-paper.git

# Go into the just downloaded repository and switch to the branch "publication-version"

cd small-world-paper
git checkout publication-version

# Start the infrastructure in the "post-analysis-state" as defined 
# in the Compose file "docker-compose.post.yml"

docker compose -f docker-compose.post.yml up 

In [None]:
# It takes some time to start containers
# when executing all cells the cell trying to remove the containers below 
# (not included in the rendering of the report) produces an error
# when executed before any Docker containers are up
# We therefore need to wait a bit (60 seconds or so) before executing the next cell. 
# This also results in having some meaningful output included in the result of the next cell

import time

time.sleep(60)

In [None]:
%%bash

# To see the status of the containers
# the STATUS of the container derived from the image ingoboerner/vebidracor-api:3.0.0 should be "Up"

docker ps -a

In [None]:
%%bash

# Stop and remove all Docker containers to avoid conflicts 
# especially regarding ports in the next section
# This cell does not show up in the final rendering of the report
# If you want to use the Docker containers above and play around with them, 
# the following commands should NOT be run

docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)

In [None]:
# should also remove the large VeBiDraCor image
# This cell does not show up in the final HTML rendering

!docker rmi ingoboerner/vebidracor-api:3.0.0

In [None]:
%%bash 

# Remove the cloned small-world-paper repository folder
# This cell does not show up in the final rendering of the report

rm -rf small-world-paper 

(section-4-3)=
## 4.3 Simplifying the Workflow: StableDraCor

The workflow presented in the section {ref}`section-4-2` above is still quite complex. There are multiple steps involved to set-up the locally running infrastructure some of which need to be run from the command line. In addition, one has to create a bare container with a database and an API, populate it with the data, conduct the analysis, create several Docker images, and, ultimately, publish them in a repository, e.g. on DockerHub. There is also the need to create the Docker Compose file that specifies which system components are needed to recreate the environment the analysis was run in at different points in time. Thus, there is a considerable need to make the process more user-friendly.


Our approach to simplifying the process focuses on developing a Python package called “StableDraCor” that makes the setting-up of local DraCor instances and populating them with data easier by somewhat “hiding” the complexity of the Docker and Docker Compose commands. While there is no real need for a generic tool managing containers and images (because this can be done with [Docker Desktop](https://www.docker.com/products/docker-desktop)), with “StableDraCor” we address the complexity of setting up the specific DraCor infrastructure components and loading DraCor corpora (or a subset thereof).

In the following, we describe the workflow using the tool. The package can be built and installed from the [GitHub repository](https://github.com/dracor-org/stabledracor)[^stabledracor_dev_version]. 

[^stabledracor_dev_version]: Currently, it is still recommended to use the version in development from the “dind” branch of [https://github.com/ingoboerner/stable-dracor/tree/dind](https://github.com/ingoboerner/stable-dracor/tree/dind) which implements a “Docker in Docker” setup. For a tutorial on how to use this setup see the [introduction notebook](https://github.com/ingoboerner/stable-dracor/blob/dind/notebooks/02_intro.ipynb).

In [None]:
%%bash

# Clone the repository
# The target directory is outside of the current working directory "report"

git clone https://github.com/ingoboerner/stable-dracor.git /home/d73/stable-dracor

# Switch into repository and checkout a commit 
# We checkout the commit because the package is still in development and we want to install a
# certain version to be used here
# the -c option is optional; it suppresses a warning about git being in so-called detached head state

cd /home/d73/stable-dracor
git -c advice.detachedHead=false checkout df31c4e

# install the package

pip install .

# cleanup: remove the previously cloned repository
cd /home/d73/report
rm -rf /home/d73/stable-dracor

After initializing a StableDraCor instance, the infrastructure can be started with a single command `run()`. If a user does not specify any parameters (like pointing to a designated custom local Docker Compose file) the script fetches a configuration of the system specified by a [Docker Compose file](https://github.com/dracor-org/stabledracor/blob/2dc461e6f3d8106f5291ba0b1f6779b7adb52c5d/configurations/compose.fullstack.empty.yml) (`compose.fullstack.yml`) and starts the defined containers. 

In [None]:
# Import the package
from stabledracor.client import StableDraCor

# Use a GitHub Personal Access Token, see 
# https://github.com/ingoboerner/stable-dracor/blob/e300d77c419537538b4d491a8bbe2b9449123131/notebooks/03_faq.ipynb

import os
github_token = os.environ.get("GITHUB_TOKEN")

# Initialize a local DraCor infrastructure
# Provide metadata like a "name" and a description

local_dracor = StableDraCor(
    name="my_local_dracor", 
    description="My local demo DraCor system",
    github_access_token=github_token
)

# Start the infrastructure

local_dracor.run()

After running the above cell in the "Executeable Report" the local DraCor frontend becomes available at [http://localhost:8088](http://localhost:8088). As can be seen in {ref}`local_dracor_frontend_empty` no corpora have been loaded yet.

% Figure is rendered in the HTML output here

```{figure} ./images/local_dracor_frontend_empty.png
---
width: 600px
name: local_dracor_frontend_empty
---
Frontend of the local DraCor infrastructure with no corpora loaded
```

The package supports setting-up local custom corpora either by copying a corpus or parts thereof from any running DraCor system, for example the [production system](https://dracor.org) or the [staging server](https://staging.dracor.org), containing even more corpora that are currently prepared for publication. 

In the following code cell the "Tatar Drama Corpus" from the production instance of DraCor to the local database (see {ref}`local_dracor_frontend_empty`).

In [None]:
local_dracor.copy_corpus(source_corpusname="tat")

After executing the command in the code cell above there will be a single corpus in the [local DraCor instance](http://localhost:8088) (see {ref}`local_dracor_tatdracor`) that is a copy of the data currently available at [https://dracor.org/tat](https://dracor.org/tat).

% Figure is rendered in the HTML output here

```{figure} ./images/local_dracor_tatdracor.png
---
width: 600px
name: local_dracor_tatdracor
---
Frontend of the local DraCor infrastructure with the "Tatar Drama Corpus" loaded
```

It is also possible to directly add TEI files from the local filesystem, which allows a user to even use the DraCor environment with data not published on [dracor.org](https://dracor.org) or a public GitHub repository. When adding data to a local Docker container with the help of the “StableDraCor” package, the program keeps track of the constitution of the corpora and the sources used.

In [None]:
%%bash

# Remove the file if downloaded in previous (incomplete) runs of the notebook
# otherwise the wget command in the next cell would fail
# This cell is not included in the final HTML rendering of the repor

rm -f ../import/lessing-emilia-galotti.xml

In [None]:
%%bash

# Download a single file to the import folder to demonstrate the import of a local file

wget https://raw.githubusercontent.com/dracor-org/gerdracor/a99060f0065856f8df114ce8556c31161c0332d1/tei/lessing-emilia-galotti.xml -P ../import

In the following code cell a single file is imported into the custom local corpus "FilesDraCor" (see {ref}`local_dracor_filesdracor`).

In [None]:
# Create a corpus "FilesDraCor" and add a single play from the folder "import" to it

local_dracor.add_plays_from_directory(
    corpusname="files",
    directory="../import/"
)

In [None]:
!ls ../import

% Figure is rendered in the HTML output here

```{figure} ./images/local_dracor_filesdracor.png
---
width: 600px
name: local_dracor_filesdracor
---
Frontend of the local DraCor infrastructure with the additionally loaded file in the corpus "FilesDraCor"
```

In [None]:
%%bash 

# Remove the Demo-File
# This cell is not included in the rendering of the final report

rm ../import/lessing-emilia-galotti.xml

To allow for better reproducibility of the local infratructure it is recommended to used the functionality to to directly load corpora or parts thereof from a GitHub repository. This method of adding data allows to specify the “version” of the data in the corpus compilation process at a given point in time by referring to a single GitHub commit. As mentioned above, because DraCor corpora are “living corpora”, it is not guaranteed that corpora that are available on the web platform do not change. Therefore, it would not be a good idea to base research aiming at being repeatable at the data in the live system. By using data directly from GitHub with StableDraCor it is possible to include only the plays that were available, let’s say, two years ago and in the encoding state they were at this time.

In the following code cell the "Bashkir Drama Corpus" is added to the local database directly from its GitHub Repository.

In [None]:
# Add the Bashkir Drama Corpus from GitHub in the version identified by a single commit

local_dracor.add_corpus_from_repo(
    repository_name="bashdracor", 
    commit="c16b58ef3726a63c431bb9575b682c165c9c0cbd")

The local DraCor infrastructure that has been set up until this point can be explored at [http://localhost:8088](http://localhost:8088) (see {ref}`local_dracor_3_corpora`.

% Figure is rendered in the HTML output here

```{figure} ./images/local_dracor_3_corpora.png
---
width: 600px
name: local_dracor_3_corpora
---
Frontend of the local DraCor infrastructure with three corpora including the "Bashkir Drama Corpus"
```

The tool keeps track of the whole configuration of the system: This includes the versions of the microservices used, and the corpora loaded. The state of the corpus is identified by a timestamp and – if the source is a GitHub repository, the commit. This “log” can be output as a “manifest” JSON object (command: `get_manifest()`), which should allow re-creating the system even if no Docker image is available. It would also allow a user to unambiguously identify the exact data that was used in a study.[^manifest_as_single_source_of_truth]

[^manifest_as_single_source_of_truth]: If someone would not want to or for some reasons could not use Docker, the manifest alone would still be a sufficient source to retrieve the files used if they come from a corpus on GitHub. Of course, if local files are used, this does not help, but at least, it makes this circumstance transparent.

The following code snippet shows such a manifest[^manifest_explained] with several corpora added:

[^manifest_explained]: For an explanation of the manifest see the [introduction notebook](https://github.com/ingoboerner/stable-dracor/blob/df31c4e6b42d0e8c6ba294efe4d26aa473719ab2/notebooks/02_intro.ipynb) to the tool. 

In [None]:
# Output the manifest documenting the local DraCor System

local_dracor.get_manifest()

“StableDraCor” supports creating a Docker image from a populated database container. With the original workflow it was necessary to do this with the `docker commit` command in the terminal. It was also necessary to provide additional documentation, for example the Jupyter notebook that was used to assemble a corpus. This information was ‘detached’ from the Docker image and it was necessary to explicitly point to this form of documentation, because it was not part of the research artifact itself. We tackled this issue with StableDracor and found a way to include machine readable documentation *about* the research artifact directly *attached* to it. Now, when we create an image with the tool, we issue a slightly different `docker commit` command that also attaches Docker Object Labels[^docker_object_labels] directly to the newly created image. We achieve this by taking the manifest as mentioned earlier, and decomposing it into single Docker Labels. StableDraCor can convert a manifest into labels but also re-convert Docker Object Labels on the image in the `org.dracor.stable-dracor.*` namespace back into a manifest. By providing the manifest information as image labels, we allow a user, for example, to retrieve information about the corpus contents and the sources of a database without having to run the image as a Docker container first. We also attach the information about the individual DraCor microservices directly to the image as labels.

[^docker_object_labels]: See (https://docs.docker.com/config/labels-custom-metadata)[https://docs.docker.com/config/labels-custom-metadata] for the documentation of Docker Object Labels.

In the following code cell a Docker Image of the the DraCor API container is created.

In [None]:
# Create a Docker image of the DraCor API container

local_dracor.create_docker_image_of_service(service="api", 
                                            image_tag="d73_demo")

In [None]:
%%bash

# Check if the new Stable DraCor API image has been created
# This cell is not included in the final HTML rendering of the report

docker images | grep stable-dracor

The labels attached to this image are as follows:

In [None]:
import subprocess, json

operation = subprocess.run(["docker", "inspect", "--type=image", "--format", "{{json . }}", "dracor/stable-dracor:d73_demo"], capture_output=True)
result = json.loads(operation.stdout.decode("utf-8"))
result["Config"]["Labels"]

In summary, our “StableDraCor” package allows to generate a fully self-describing, completely versionized research artifact that alone is sufficient to replicate the corpora and their research infrastructure that were used in a study.

In [None]:
%%bash

# Stop and remove all Docker containers to avoid conflicts 
# especially regarding ports in the next section
# This cell does not show up in the final rendering of the report
# If you want to use the Docker containers above and play around with them, 
# the following commands should NOT be run

docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)

(section-4-4)=
## 4.4 Practical Examples

The following section demonstrates how the concepts and tools introduced in this report can be used to facilitate future repeating of research based on DraCor infrastructure components and corpora. The first example shows how to identify and then stabilize the exact version of the corpus used in a research paper that was discussed earlier in section {ref}`section-3-1`. In the second example a version of the "DLINA Corpus Sydney" is created on the basis of recent DraCor data to allow for repeating research carried out in the context of the DLINA project on the same set of plays.[^dlina_research]

[^dlina_research]: For research carried out using the DLINA corpus see [DLINA Blog](https://dlina.github.io) and the [DraCor Research Page](https://dracor.org/doc/research)). For an example of a re-implementation of an algorithm developed in DLINA to be used with the DraCor API see the Jupyter Notebook [To catch a protagonist in DraCor](https://github.com/dracor-org/dracor-notebooks/blob/2452d3f23de449783f4c964708cf9e88c65e35af/catch-a-protagonist-in-dracor/catch-a-protagonist-in-dracor.ipynb).

In [None]:
# This is needed to re-use outputs of code in the markdown cells. 
# This cell is removed in the rendered report

from myst_nb import glue

(section-4-4-1)=
### 4.4.1 Reconstructing (and Stabilizing) Corpora Used to Train and Evaluate a Classifier for Chiasmus Detection 

In {ref}`section-3-1` the paper „Data-Driven Detection of General Chiasmi Using Lexical and Semantic Features“ {cite:p}`schneider_2021_chiasmi` is discussed as an example of research that re-uses the German Drama Corpus. The authors do not use the DraCor API for their study but download data directly from the GerDraCor GitHub Repository. The only information that hints at which version of the corpus was used to train and test the classifier is the number of plays that were included in the corpus at the time. The authors report that GerDraCor included 504 plays. 

In the following code cells we use the "corpus archeology script" described in the section {ref}`excursus` to identify the actual version of the German Drama Corpus used and load this version into a local DraCor infrastructure as described in section {ref}`section-4-3`:  

In [None]:
# Setup (This partly repeates code that has been executed elsewhere; make sure everything is available)
# This cell is not included in the final HTML rendering of the report

# The methods needed for the following analysis are bundled as the class "GitHubRepo" 
# which we import in with the following line

from github_utils import GitHubRepo

# Import the package
from stabledracor.client import StableDraCor

# Use a GitHub Personal Access Token, see 
# https://github.com/ingoboerner/stable-dracor/blob/e300d77c419537538b4d491a8bbe2b9449123131/notebooks/03_faq.ipynb

import os
github_token = os.environ.get("GITHUB_TOKEN")

In [None]:
# Start the analysis with previously downloaded data

repo = GitHubRepo(repository_name="gerdracor", 
                  github_access_token=github_token,
                  import_commit_list="tmp/gerdracor_commits.json",
                  import_commit_details="tmp/gerdracor_commits_detailed.json",
                  import_data_folder_objects="tmp/gerdracor_data_folder_objects.json",
                  import_corpus_versions="tmp/gerdracor_corpus_versions.json")

# To get the version with 504 plays:
# Get the versions as a dataframe containing the number of plays included ("document_count")

play_counts_df = repo.get_corpus_versions_as_df(columns=["id","date_from","document_count"])

# Filter the dataframe on versions that have exactly 504 plays
play_counts_df[play_counts_df["document_count"] == 504]

In [None]:
# This cell is not included in the final rendering

glue("chiasmus_corpus_version_id", 
         play_counts_df[play_counts_df["document_count"] == 504].iloc[0]["id"])

chiasmus_corpus_version_date_object = play_counts_df[play_counts_df["document_count"] == 504].iloc[0]["date_from"].date()
glue("chiasmus_corpus_version_date_formatted", chiasmus_corpus_version_date_object.strftime("%-d %B, %Y"))

The most probable version of the GerDraCor data used is identified by the SHA value {glue:text}`chiasmus_corpus_version_id` and dates from {glue:text}`chiasmus_corpus_version_date_formatted`.[^sighum5] In the following cell a Docker container of the DraCor API is created and populated with this exact corpus version (see {ref}`local_dracor_504_gerdracor_plays`):

[^sighum5]: This date is at least also plausible because it is far before the date the research paper was presented at 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature which that took place in November 2021, both online and in person in Punta Cana, Dominican Republic. The paper submission deadline for this event was in August 2021 according to the [conference website](https://sighum.wordpress.com/events/latech-clfl-2021/important-dates).

In [None]:
# the next cell should maybe not execute when building the report. See also issues below: 
# https://discourse.jupyter.org/t/is-there-a-cell-tag-convention-to-skip-execution/5445
# https://github.com/executablebooks/jupyter-book/issues/833

In [None]:
# Initialize the StableDraCor instance; add metadata

chiasmus_detection_dracor = StableDraCor(
    name="chiasmus_detection", 
    description="DraCor system including the GerDraCor version used in the paper 'Data-Driven Detection of General Chiasmi'",
    github_access_token=github_token
)

# Run the infrastructure

chiasmus_detection_dracor.run()

# Add the corpus version
# Because data is ingested (which is a slow process) this will take some time (approx. 30min!)
# Be very patient. You can check http://localhost:8088 to see the progress. The number of plays
# in the corpus should constantly increase when reloading the page.

chiasmus_version_commit_id = play_counts_df[play_counts_df["document_count"] == 504].iloc[0]["id"]

chiasmus_detection_dracor.add_corpus_from_repo(
    repository_name="gerdracor", 
    commit=chiasmus_version_commit_id)

% Figure is rendered in the HTML output here

```{figure} ./images/local_dracor_504_gerdracor_plays.png
---
width: 600px
name: local_dracor_504_gerdracor_plays
---
Frontend of the local DraCor infrastructure with GerDraCor (version: 6e1020dcfcb98a0d027ceb401a6a5fbd4537fe29) including 504 plays
```

For training of the classifier a manually annotated data set consisting of four plays by the author Friedrich Schiller are used (cf. {cite:p}`schneider_2021_chiasmi{p.98}`; citations can also be found in {ref}`section-3-1`) In the paper the titles of these plays are included. In the following list DraCor identifiers are added:

* *Die Piccolomini* (`schiller-die-piccolomini`, [ger000086](https://dracor.org/id/ger000086)) 
* *Wallensteins Lager* (`schiller-wallensteins-lager`, [ger000025](https://dracor.org/id/ger000025)) 
* *Wallensteins Tod* (`schiller-wallensteins-tod`, [ger000058](https://dracor.org/id/ger000058))
* *Wilhelm Tell* (`schiller-wilhelm-tell`, [ger000452](https://dracor.org/id/ger000452))

% This cell is removed from the rendering, the citations are in another section
> “We perform two types of experiments. [...] In the second experiment we evaluate how well our model generalizes to texts from different authors not included in the training data. To this end we extract PoS tag inversions from the GerDraCor corpus (Fischer et al., 2019) [...]” The training data set (https://git.io/ChiasmusData) “[...] consists of four annotated texts by Friedrich Schiller Die Piccolomini, Wallensteins Lager, Wallensteins Tod and Wilhelm Tell. We annotated the whole texts, finding 45 general chiasmi and 9 antimetaboles.” {cite:p}schneider_2021_chiasmi{p.98} emphasis [bold] by us

And further

> “[...] we evaluate the generalization performance of our chiasmus classifier trained on the four annotated Schiller dramas to other texts. The first set of texts comprises seven other dramas by Friedrich Schiller [...]. To see how well our method generalizes to different authors, we tested it on the remaining 493 documents from GerDraCor.” {cite:p}schneider_2021_chiasmi{p.98} emphasis [bold] by us

In [None]:
# [...] four annotated texts by Friedrich Schiller
# Die Piccolomini, 
# Wallensteins Lager, 
# Wallensteins Tod 
# and Wilhelm Tell.

# Add an empty new corpus "training" with the following metadata

chiasmus_annotated_corpus_metadata = {
    "name" : "training", 
    "title": "Schiller Training Corpus",
    "description": "Corpus of four plays by Friedrich Schiller used to train the Chiasmus Classifier"
}

chiasmus_detection_dracor.add_corpus(corpus_metadata=chiasmus_annotated_corpus_metadata)

# Create a list with the playnames/filenames of the plays to add

chiasmus_annotated_schiller_corpus_playnames = [
    "schiller-die-piccolomini",
    "schiller-wallensteins-lager",
    "schiller-wallensteins-tod",
    "schiller-wilhelm-tell"]

# Add each play in the respective version to the previously created corpus

for playname in chiasmus_annotated_schiller_corpus_playnames:
    chiasmus_detection_dracor.add_play_version_to_corpus(
        filename=playname,
        repository_name="gerdracor",
        commit=chiasmus_version_commit_id,
        corpusname="training")

The classifier is then tested on a data set consisting of the other seven plays by Friedrich Schiller included in the GerDraCor corpus at that time. The authors do not explicitly spell out the titles but with having the [local instance of GerDraCor](https://localhost:8088/ger) one can filter for the plays by Schiller or even use the API to filter out plays by a certain author (identified by a Wikidata Identfier, here: [Q22670](https://wikidata.org/entity/Q22670)) as is demonstrated in the following hidden code cell. We use this information to create an additional corpus `test` in the local DraCor instance.

In [None]:
# see example of Schnitzler Corpus in:
# https://github.com/dracor-org/dracor-notebooks/blob/docker/docker/local-dracor-with-docker.ipynb
# Needs to be implemented

In [None]:
# [...] seven other dramas by Friedrich Schiller [...]

# Add an empty new corpus "test" with the following metadata

chiasmus_test_corpus_metadata = {
    "name" : "test", 
    "title": "Schiller Test Corpus",
    "description": "Corpus of the seven other plays by Friedrich Schiller used to test the Chiasmus Classifier"
}

chiasmus_detection_dracor.add_corpus(corpus_metadata=chiasmus_test_corpus_metadata)

# Create a list with the playnames/filenames of the plays to add

chiasmus_schiller_testset_playnames = [
    "schiller-maria-stuart",
    "schiller-kabale-und-liebe",
    "schiller-don-carlos-infant-von-spanien",
    "schiller-die-verschwoerung-des-fiesco-zu-genua",
    "schiller-die-raeuber",
    "schiller-die-jungfrau-von-orleans",
    "schiller-die-braut-von-messina"]

# Add each play in the respective version to the previously created corpus

for playname in chiasmus_schiller_testset_playnames:
    chiasmus_detection_dracor.add_play_version_to_corpus(
        filename=playname,
        repository_name="gerdracor",
        commit=chiasmus_version_commit_id,
        corpusname="test")


The last corpus that is added to the local DraCor instance contains the remaining 493 documents from GerDraCor that are also used in the paper for testing purposes. In the hidden code cell below this corpus called `rest` is created:

In [None]:
# "[...] remaining 493 documents from GerDraCor" therefore nedd to exclude 
# all the plays by Friedrich Schiller in the two lists above

chiasmus_playnames_exclude = chiasmus_annotated_schiller_corpus_playnames + chiasmus_schiller_testset_playnames

chiasmus_rest_corpus_metadata = {
    "name" : "rest", 
    "title": "Non-Schiller plays",
    "description": "Corpus of the remaining plays in GerDraCor used to test the Chiasmus Classifier"
}

# Add the GerDraCor version, but change the Corpus Metadata; exclude all Schiller-plays added before as separate corpora

chiasmus_detection_dracor.add_corpus_from_repo(
                             commit = chiasmus_version_commit_id,
                             repository_name = "gerdracor",
                             use_metadata_of_corpus_xml = False,
                             corpus_metadata = chiasmus_rest_corpus_metadata, 
                             exclude = chiasmus_playnames_exclude)

% Figure is rendered in the HTML output here

```{figure} ./images/local_chiasmus_detection_dracor.png
---
width: 600px
name: local_chiasmus_detection_dracor
---
Frontend of the local DraCor infrastructure including the corpora mentioned in the "Chiasmus Detection" paper
```

In [None]:
# Output the manifest of the local DraCor infrastructure
# The Schiller Corpora are not included; this functionality has not been implemented yet!
# Would need to actively include the "include" field of the manifest
# Creating the image doesn't make too much sense. The labels won't be right
# This cell is not included in the final HTML rendering of the report

chiasmus_detection_dracor.get_manifest()

(section-4-4-2)=
### 4.4.2 GerDraCor-based DLINA Sydney Corpus

In section {ref}`excursus-gerdracor-birth` the DLINA Sidney Corpus was mentioned as the main source of the later GerDraCor corpus. The original DLINA corpus contained 465 plays taken from the TextGrid Repository. As the corpus archeology (see {ref}`excursus_major_revisions`) has shown there were several significant changes to the encoding which would result in the early added data not being compatible with the current versions of the DraCor API. Still, it might be an interesting use case to repeat earlier studies of the DLINA collective on the basis of a recent version of the data. Therefore, the following code samples demonstrate how a GerDraCor-based DLINA corpus can be built in a local DraCor setup. It will include all 465 plays originally included in the DLINA corpus, but in the versions as are included in the latest GerDraCor.

In [1]:
%%bash

# Stop and remove all Docker containers to avoid conflicts 
# especially regarding ports in the next section
# This cell does not show up in the final rendering of the report
# If you want to use the Docker containers above and play around with them, 
# the following commands should NOT be run

docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)

c2a6d50857a0
7791361a5f54
233a4177125c
08dc69c0dbed
c2a6d50857a0
7791361a5f54
233a4177125c
08dc69c0dbed


In [2]:
# Setup (This partly repeates code that has been executed elsewhere; make sure everything is available)
# This cell is not included in the final HTML rendering of the report

# The methods needed for the following analysis are bundled as the class "GitHubRepo" 
# which we import in with the following line

from github_utils import GitHubRepo

#Import the package
from stabledracor.client import StableDraCor

# Use a GitHub Personal Access Token, see 
# https://github.com/ingoboerner/stable-dracor/blob/e300d77c419537538b4d491a8bbe2b9449123131/notebooks/03_faq.ipynb

import os
github_token = os.environ.get("GITHUB_TOKEN")

# For analysing the GerDraCor commit history: Start the GitHub commits analysis with previously downloaded data

repo = GitHubRepo(repository_name="gerdracor", 
                  github_access_token=github_token,
                  import_commit_list="tmp/gerdracor_commits.json",
                  import_commit_details="tmp/gerdracor_commits_detailed.json",
                  import_data_folder_objects="tmp/gerdracor_data_folder_objects.json",
                  import_corpus_versions="tmp/gerdracor_corpus_versions.json")

In [3]:
dlina_gerdracor = StableDraCor(
    name="dlina_gerdracor", 
    description="DraCor system including all plays available in the DLINA Sydney corpus in a recent DraCor encoding",
    github_access_token=github_token
)

dlina_gerdracor.run()

 Container dlina_gerdracor-metrics-1  Creating
 Container dlina_gerdracor-fuseki-1  Creating
 Container dlina_gerdracor-metrics-1  Created
 Container dlina_gerdracor-fuseki-1  Created
 Container dlina_gerdracor-api-1  Creating
 Container dlina_gerdracor-api-1  Created
 Container dlina_gerdracor-frontend-1  Creating
 Container dlina_gerdracor-frontend-1  Created
 Container dlina_gerdracor-fuseki-1  Starting
 Container dlina_gerdracor-metrics-1  Starting
 Container dlina_gerdracor-fuseki-1  Started
 Container dlina_gerdracor-metrics-1  Started
 Container dlina_gerdracor-api-1  Starting
 Container dlina_gerdracor-api-1  Started
 Container dlina_gerdracor-frontend-1  Starting
 Container dlina_gerdracor-frontend-1  Started


True

With the commit "fdac66ba90c2c094012dc90395e952411d324e4c" the original DLINA file names of all TEI-XML files are changed to now match the identifier playname in GerDraCor, but after that, there are still some files renamed.

In [4]:
# Should get the list of the original playnames, 
# but make sure there are no problems with renamed plays

version_renaming_id = "fdac66ba90c2c094012dc90395e952411d324e4c"

# From the dictonary representing the "version" get the field "playnames" that contains the
# plays available in a certain version

dlina_playnames = repo.get_corpus_version(version=version_renaming_id)["playnames"]

In [None]:
# Get the file-names from the renaming version

In [5]:
rename_incidents = repo.get_renamed_files(
    exclude_versions=["e18c322706417825229f1471b15bd6daaeaf3ab1", 
                     "fdac66ba90c2c094012dc90395e952411d324e4c"])

In [6]:
new_gerdracor_playnames = []
unchanged_playnames = []

for original_playname in dlina_playnames:
    renamed_flag = False
    for item in rename_incidents:
        if f"tei/{original_playname}.xml" == item["previous_filename"]:
            new_gerdracor_playname = item["new_filename"].split("/")[1].replace(".xml","")
            new_gerdracor_playnames.append(new_gerdracor_playname)
            renamed_flag = True
    if renamed_flag is False:
        unchanged_playnames.append(original_playname)

In [None]:
len(new_gerdracor_playnames)

In [None]:
len(unchanged_playnames)

In [None]:
assert len(new_gerdracor_playnames) + len(unchanged_playnames) == 465, "Not all renamed plays detected."

In [7]:
# Get the latest commit to the GerDraCor repository. 
# This will be included in the description of the new corpus
# it is also needed because this version will be used as source of the plays added to the corpus

latest_gerdracor_version_id = repo.get_latest_corpus_version()["id"]

# Add an empty new corpus "dlina" with the following metadata

dlina_corpus_metadata = {
    "name" : "dlina", 
    "title": "GerDraCor-based DLINA Corpus Sydney",
    "description": f"Version of the German Drama Corpus (GerDraCor, version {latest_gerdracor_version_id}) containing only plays available in the DLINA Corpus Sydney"
}

dlina_gerdracor.add_corpus(corpus_metadata=dlina_corpus_metadata)

True

In [8]:
# Add each play in the respective version to the previously created corpus

playnames_to_add = new_gerdracor_playnames + unchanged_playnames

for playname in playnames_to_add:
    dlina_gerdracor.add_play_version_to_corpus(
        filename=playname,
        repository_name="gerdracor",
        commit=latest_gerdracor_version_id,
        corpusname="dlina")



In [None]:
# This is not a renaming problem with fouque-der-held-des-nordens; 
# the play has been split up in three

# Mir fällt schon auf, Ich frag hier immer wegen DLINA... also https://dlina.github.io/linas/212/ gibt es so nicht mehr; das ist kein Fall von "Umbenennung"; ist das aus dem ursprünglichen lina212 -> 
# TEI version	ger000212, ger000471 und ger000470 geworden?

#pt jip, das ist in DraCor gesplittet

# ff: a, die 5 mehrteiligen stücke in dieser liste (von 2015) hatten wir schon vorher gesplittet:https://github.com/DLiNa/project/blob/master/data/TextGrid-Repository---List-of-all-dramatic-texts.txt

# (in der liste die nummern 18, 137, 217, 273, 490 – das waren aber noch keine dracor-IDs)

#bei fouqués held des nordens fiel mir die mehrteiligkeit dann erst auf, als ich stück für stück durchgegangen bin, daher das spätere splitting…

In [None]:
# Sigurd, der Schlangentödter
# "fouque-sigurds-rache"
# Aslauga

In [None]:
# The DLINA play https://dlina.github.io/linas/212/ is actually a trilogy; 
#In GerDraCor the play is split into three indiviual plays

fouque_held_des_nordens_part_playnames = [
    "fouque-sigurd-der-schlangentoedter",
    "fouque-sigurds-rache",
    "fouque-aslauga"
]

for playname in fouque_held_des_nordens_part_playnames:
    dlina_gerdracor.add_play_version_to_corpus(
        filename=playname,
        repository_name="gerdracor",
        commit=latest_gerdracor_version_id,
        corpusname="dlina")

% Figure is rendered in the HTML output here

```{figure} ./images/local_dlina_dracor.png
---
width: 600px
name: local_dlina_dracor
---
Frontend of the local DraCor infrastructure including the plays contained in the DLINA Corpus Sydney
```

In [None]:
# Get links to the commits that add the three individual plays

for playname in fouque_held_des_nordens_part_playnames:
    print(repo.get_github_commit_url_of_version(
        repo.get_corpus_version_adding_play(playname=playname)["id"]))

On December 16th, 2018 the play with the filename `fouque-der-held-des-nordens.xml` which is actually a trilogy was split up into three individual plays.

In [None]:
%%bash

# Stop and remove all Docker containers to avoid conflicts 
# especially regarding ports in the next section
# This cell does not show up in the final rendering of the report
# If you want to use the Docker containers above and play around with them, 
# the following commands should NOT be run

docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)