3. Versioning Living Corpora Using Git Commit#
In the following, we show the capabilities of Git for the versioning and change tracking of living corpora. We will do this by describing and analyzing the evolution of a GitHub repository that contains a DraCor corpus. While the PDF version of this report we only document the analysis, this web-based version of this analysis is executable and fully reproducible.[1] Our analysis is conducted with the German Drama Corpus (GerDraCor), but the method used (and the code implemented) will be largely applicable to any other DraCor corpus.
3.1 How to Better Not Cite a Living Corpus. An Example From Current Research#
In this first step, we will take a short and exemplary look at an actual CLS research project and how it deals with the living corpora of DraCor. Our aim is to show that the way DraCor is cited is insufficient to enable reproducibility of the research.
It has become quite common for research that use DraCor corpora to
cite the paper [Fischer et al., 2019][2]
include the information on how many plays are in the corpus used.
Plays used as examples are mostly referenced by author and title (and not, what we would recommend, by their DraCor ID[3]). This can, for example be observed in the following quotations of a research paper that uses GerDraCor to develop and test a tool using machine learning methods to detect chiasmi in literary texts:
“We perform two types of experiments. […] In the second experiment we evaluate how well our model generalizes to texts from different authors not included in the training data. To this end we extract PoS tag inversions from the GerDraCor corpus (Fischer et al., 2019) […]” The training data set (https://git.io/ChiasmusData) “[…] consists of four annotated texts by Friedrich Schiller Die Piccolomini, Wallensteins Lager, Wallensteins Tod and Wilhelm Tell. We annotated the whole texts, finding 45 general chiasmi and 9 antimetaboles.” [Schneider et al., 2021, p.98] emphasis [bold] by us
And further
“[…] we evaluate the generalization performance of our chiasmus classifier trained on the four annotated Schiller dramas to other texts. The first set of texts comprises seven other dramas by Friedrich Schiller […]. To see how well our method generalizes to different authors, we tested it on the remaining 493 documents from GerDraCor.” [Schneider et al., 2021, p.98] emphasis [bold] by us
Although the authors publish their tool and the derived [dataset](https://git.io/ChiasmusData / cvjena/chiasmus-annotations) as open source resp. open data, it is not self-evident which version of GerDraCor was used. The only information that may support the identification of the version is the information about the number of plays “504”[4] included in GerDraCor at the time of assembling the training and test data set based on the corpus (and, of course, bibliographic metadata of the study itself, such as the date of publication). But in fact, there might be more than one version with 504 plays.
Based on this information, it is therefore not clear what data was used exactly in the study. However, this would be a problem for reproduction of this research. But how the problem could be solved? In the next chapter, we will show that there is a quite simple and elegant solution: Git commit history.
3.2 Citing Git Commits as Corpus Versions. Introduction#
In our report “On Programmable Corpora” we have already introduced GitHub as a “key infrastructural component” in developing the DraCor toolset as well as in curating and hosting DraCor corpora. Previously, we have also relied on GitHub to directly link into the codebase of the DraCor API and other components of our ecosystem when explaining its inner workings (cf. Börner and Trilcke [2023]). However, in this section, it is the platform GitHub itself that is in the focus of our attention when we try to demonstrate how to effectively deal with datasets that are constantly in flux. Because DraCor is using Git (and respectively GitHub) for publishing corpora the process of creating and maintaining a corpus is fully transparent and traceable. As we will show, this also opens up unrivaled possibilities for versioning and the corresponding referencing of living corpora.
Unlike the repositories of DraCor software components (cf. the repository of the DraCor API) for which releases are published, in the case of corpus repositories this feature is (curently) not used.[5] However, it is still possible to very precisely point to a single “version” (or “snapshot”) of the data set. This can be done by referring to an individual commit[6]. Because all editing operations are “recorded” or “logged” when committed, the commits can be used to reconstruct the state of a corpus of a given point in time. We can consider the commits the “implicit versions” of DraCor corpora.
The GUI of GitHub already provides powerful tools to dive into the commit history of a corpus data set. The commit history of the repository dracor-org/gerdracor can be easily reached from the landing page (see Fig. 1).

Fig. 1 Landing page of the repository of GerDraCor on GitHub#
The header above the file listing of the root folder (see Fig. 2) includes a link to the latest commit as well as the commit history (see Fig. 2).

Fig. 2 Links to most recent commit and commit history#
The commit history allows for filtering commits by a certain date range, e.g. it is possible to display commits dating from February 2018. From this list a single commit can be explored, e.g. from February 14th 2018. This commit is identified by the SHA value of 30760ec3ff4aa340f785bcc17bfd3ca81e7e2d06
, which can also be found as part of the URL in the address bar of the browser.
From the single commit view it is possible to get to all TEI-XML files of the plays in the corpus at that point in time. This can either be done by clicking on the button “Browse files” in the upper right corner of the gray commit page header and then, on the landing page, by navigating to the folder tei
; or, as a shortcut, by directly changing the URL in the address bar of the browser: To address the TEI files in the state of February 2018 the commit identifier /tree/{commit SHA}/tei
can be appended to the URL of the GerDraCor repository https://github.com/dracor-org/gerdracor
, resulting in: dracor-org/gerdracor
This example demonstrates that even without specialized tools and just by using the GitHub Web Interface it is straightforward to precisely retrieve a dated “version” of the corpus files. The only requirement is that the commit, or at least, the precise date or the date range in which the corpus was used is known.
3.3 Retrieving (Technical) Corpus Metadata via the GitHub API#
To retrieve metadata about the commits and, thus, the state of a corpus (the “implicit version”) at a given point in time, the GitHub API is used.[7] We will illustrate some functions of the API that are relevant in the following. Although we include URLs of concrete examples, the implemented methods to retrieve the data for the analysis in the excursus (see next chapter) will work the same way.
A list of commits of a repository including basic metadata can be requested from the URL https://api.github.com/repos/dracor-org/gerdracor/commits.
This returns the commits in the repository in batches of 30 commits starting with the most recent one. The respective API operation is used to retrieve the identifiers of the commits (dictionary key sha
) and the dates when the changes were committed. See the following code cell for an example.
Show code cell content
# first commit in the array returned in the respone
# of a request to https://api.github.com/repos/dracor-org/gerdracor/commits
r = requests.get(url="https://api.github.com/repos/dracor-org/gerdracor/commits")
if r.status_code == 200:
get_commits_example_response_data = json.loads(r.text)
get_commits_example_response_data[0]
{'sha': '5eb31c736cd103f5d3f0994b1e932141fc63a1d1',
'node_id': 'C_kwDOBH09MdoAKDVlYjMxYzczNmNkMTAzZjVkM2YwOTk0YjFlOTMyMTQxZmM2M2ExZDE',
'commit': {'author': {'name': 'Frank Fischer',
'email': 'lehkost@users.noreply.github.com',
'date': '2024-04-16T16:11:25Z'},
'committer': {'name': 'Frank Fischer',
'email': 'lehkost@users.noreply.github.com',
'date': '2024-04-16T16:11:25Z'},
'message': 'add play',
'tree': {'sha': '1a1538437b0fd40608b8327aaff4f89a4f74a141',
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/trees/1a1538437b0fd40608b8327aaff4f89a4f74a141'},
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/commits/5eb31c736cd103f5d3f0994b1e932141fc63a1d1',
'comment_count': 0,
'verification': {'verified': False,
'reason': 'unsigned',
'signature': None,
'payload': None}},
'url': 'https://api.github.com/repos/dracor-org/gerdracor/commits/5eb31c736cd103f5d3f0994b1e932141fc63a1d1',
'html_url': 'https://github.com/dracor-org/gerdracor/commit/5eb31c736cd103f5d3f0994b1e932141fc63a1d1',
'comments_url': 'https://api.github.com/repos/dracor-org/gerdracor/commits/5eb31c736cd103f5d3f0994b1e932141fc63a1d1/comments',
'author': {'login': 'lehkost',
'id': 6539515,
'node_id': 'MDQ6VXNlcjY1Mzk1MTU=',
'avatar_url': 'https://avatars.githubusercontent.com/u/6539515?v=4',
'gravatar_id': '',
'url': 'https://api.github.com/users/lehkost',
'html_url': 'https://github.com/lehkost',
'followers_url': 'https://api.github.com/users/lehkost/followers',
'following_url': 'https://api.github.com/users/lehkost/following{/other_user}',
'gists_url': 'https://api.github.com/users/lehkost/gists{/gist_id}',
'starred_url': 'https://api.github.com/users/lehkost/starred{/owner}{/repo}',
'subscriptions_url': 'https://api.github.com/users/lehkost/subscriptions',
'organizations_url': 'https://api.github.com/users/lehkost/orgs',
'repos_url': 'https://api.github.com/users/lehkost/repos',
'events_url': 'https://api.github.com/users/lehkost/events{/privacy}',
'received_events_url': 'https://api.github.com/users/lehkost/received_events',
'type': 'User',
'site_admin': False},
'committer': {'login': 'lehkost',
'id': 6539515,
'node_id': 'MDQ6VXNlcjY1Mzk1MTU=',
'avatar_url': 'https://avatars.githubusercontent.com/u/6539515?v=4',
'gravatar_id': '',
'url': 'https://api.github.com/users/lehkost',
'html_url': 'https://github.com/lehkost',
'followers_url': 'https://api.github.com/users/lehkost/followers',
'following_url': 'https://api.github.com/users/lehkost/following{/other_user}',
'gists_url': 'https://api.github.com/users/lehkost/gists{/gist_id}',
'starred_url': 'https://api.github.com/users/lehkost/starred{/owner}{/repo}',
'subscriptions_url': 'https://api.github.com/users/lehkost/subscriptions',
'organizations_url': 'https://api.github.com/users/lehkost/orgs',
'repos_url': 'https://api.github.com/users/lehkost/repos',
'events_url': 'https://api.github.com/users/lehkost/events{/privacy}',
'received_events_url': 'https://api.github.com/users/lehkost/received_events',
'type': 'User',
'site_admin': False},
'parents': [{'sha': 'a2d525c403208718b5d65ef29dc8dfb8ab710477',
'url': 'https://api.github.com/repos/dracor-org/gerdracor/commits/a2d525c403208718b5d65ef29dc8dfb8ab710477',
'html_url': 'https://github.com/dracor-org/gerdracor/commit/a2d525c403208718b5d65ef29dc8dfb8ab710477'}]}
More detailed information on a single commit can be retrieved by attaching the SHA value to the URL of the commits endpoint. So, the detailed metadata of the commit identified by the SHA 67fa8b39c90d4a1952d11c771b5d58175a8ccdf4
can be retrieved by sending a request to the URL https://api.github.com/repos/dracor-org/gerdracor/commits/67fa8b39c90d4a1952d11c771b5d58175a8ccdf4.
On the basis of the returned data it is possible, for example, to find out which files had been added, modified, renamed or deleted (see status
in the files
section of the response object) in a given commit. In the case of the commit in question, in the files
section of the returned JSON object, the TEI-XML file kotzebue-das-posthaus-in-treuenbrietzen.xml
of the play “Das Posthaus in Treuenbrietzen” by the author August von Kotzebue is listed with added
as its status field value. To see the example expand the following code cell.
Show code cell content
# "files" section of the object returned when requesting commit details
r = requests.get(url="https://api.github.com/repos/dracor-org/gerdracor/commits/67fa8b39c90d4a1952d11c771b5d58175a8ccdf4")
if r.status_code == 200:
get_commit_details_example = json.loads(r.text)
get_commit_details_example["files"]
[{'sha': '0f0008dfcb846ae837b0b5de55753ced5059f2cb',
'filename': 'tei/kotzebue-das-posthaus-in-treuenbrietzen.xml',
'status': 'added',
'additions': 2592,
'deletions': 0,
'changes': 2592,
'blob_url': 'https://github.com/dracor-org/gerdracor/blob/67fa8b39c90d4a1952d11c771b5d58175a8ccdf4/tei%2Fkotzebue-das-posthaus-in-treuenbrietzen.xml',
'raw_url': 'https://github.com/dracor-org/gerdracor/raw/67fa8b39c90d4a1952d11c771b5d58175a8ccdf4/tei%2Fkotzebue-das-posthaus-in-treuenbrietzen.xml',
'contents_url': 'https://api.github.com/repos/dracor-org/gerdracor/contents/tei%2Fkotzebue-das-posthaus-in-treuenbrietzen.xml?ref=67fa8b39c90d4a1952d11c771b5d58175a8ccdf4'}]
Another bit of information that is helpful when trying to reconstruct the state of a corpus, especially the files included, at a given point in time is the “tree” of the commit. The respective URL to request this information is included in the basic commit metadata as well as in the more detailed response in the tree
section. Expand the next cell to see an example.
Show code cell content
# tree object in the first commit returned by the /commits endpoint
get_commits_example_response_data[0]["commit"]["tree"]
{'sha': '1a1538437b0fd40608b8327aaff4f89a4f74a141',
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/trees/1a1538437b0fd40608b8327aaff4f89a4f74a141'}
So the tree of the above mentioned commit can be retrieved at https://api.github.com/repos/dracor-org/gerdracor/git/trees/3cbc81976a06a565d3ca673e3c17527bf6e30f8b.
To access the metadata of the individual files containing the play data, the data folder has to be identified. As usual for DraCor, also in the case inspected here it is the tei
folder. Expand the code cell to see an example.
Show code cell content
# Get the the tree object starting from the commit
# this is basically a file listing of the root folder
# need to look for the dictionary with the value "tree" of the field with the key "type"
# "blob" are files, "tree" are folders
# normally, the data is in a folder with the name "tei" (but this has not always be the case)
example_tree_url = get_commits_example_response_data[0]["commit"]["tree"]["url"]
r = requests.get(url=example_tree_url)
if r.status_code == 200:
example_tree_data = json.loads(r.text)
example_tree_data
{'sha': '1a1538437b0fd40608b8327aaff4f89a4f74a141',
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/trees/1a1538437b0fd40608b8327aaff4f89a4f74a141',
'tree': [{'path': 'README.md',
'mode': '100644',
'type': 'blob',
'sha': 'a7e7752d5803d3cd3fac6e70291fe0f9cd6e4829',
'size': 5022,
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/blobs/a7e7752d5803d3cd3fac6e70291fe0f9cd6e4829'},
{'path': 'corpus.xml',
'mode': '100644',
'type': 'blob',
'sha': '5480e4e61c741e086fe0886a738065829baf357d',
'size': 1092,
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/blobs/5480e4e61c741e086fe0886a738065829baf357d'},
{'path': 'css',
'mode': '040000',
'type': 'tree',
'sha': '98cce395ee68987dfbb57cc0276e832452724591',
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/trees/98cce395ee68987dfbb57cc0276e832452724591'},
{'path': 'format.conf',
'mode': '100644',
'type': 'blob',
'sha': '41e31477abaa75d8ca102229c2e92e9999caada4',
'size': 183,
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/blobs/41e31477abaa75d8ca102229c2e92e9999caada4'},
{'path': 'numOfSpeakers.png',
'mode': '100644',
'type': 'blob',
'sha': 'c1b4dfe969e6324fd59f707fe7c7ac4a185a2ce1',
'size': 9858,
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/blobs/c1b4dfe969e6324fd59f707fe7c7ac4a185a2ce1'},
{'path': 'playsPerDecade.png',
'mode': '100644',
'type': 'blob',
'sha': 'd610f042bf593b2bf123e3900d60a8b5bacef845',
'size': 6524,
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/blobs/d610f042bf593b2bf123e3900d60a8b5bacef845'},
{'path': 'tei',
'mode': '040000',
'type': 'tree',
'sha': '4e510dd9efb91f26f5d62f2d6a686c87a28f314d',
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/trees/4e510dd9efb91f26f5d62f2d6a686c87a28f314d'}],
'truncated': False}
So by requesting the data from
https://api.github.com/repos/dracor-org/gerdracor/git/trees/64a98327331abbaa110fe9c9db11208aad3ced90
we receive information about the individual file objects, most notably the filename in the field with the key path
and the file size (size
) in bytes. Expand the following cell for an example.
Show code cell content
# Example of the metadata of a single TEI-XML file of a play in the "tei" folder
tei_folder_example_url = "https://api.github.com/repos/dracor-org/gerdracor/git/trees/64a98327331abbaa110fe9c9db11208aad3ced90"
r = requests.get(url=tei_folder_example_url)
if r.status_code == 200:
tei_folder_contents_example = json.loads(r.text)
tei_folder_contents_example["tree"][0]
{'path': 'achat-ein-april-scherz.xml',
'mode': '100644',
'type': 'blob',
'sha': '87cd9f61a04cb322c0815afec694a49e4fd910b1',
'size': 102669,
'url': 'https://api.github.com/repos/dracor-org/gerdracor/git/blobs/87cd9f61a04cb322c0815afec694a49e4fd910b1'}