A vibrant discussion followed my March 15th post, “A Proposal for a Corpus Sharing Protocol.”. Carrie Schroeder, Allen Riddel and others on Twitter pointed out that, especially in non-English DH fields, many corpora are already on GitHub. These include texts from the Chinese Buddhist Electronic Text Association, the Open Greek and Latin Project at Leipzig, and papyri from the Integrating Digital Papyrology Project. The Text Creation Partnership has released some 25,000 of their texts in January of this year, and uploaded them to GitHub. One of the more interesting Git corpus projects I became aware of following this discussion is GITenberg. Led by Seth Woodworth, the project scrapes a text from Project Gutenberg, initializes a git repository for it, adds README and CONTRIBUTING files generated from the text’s metadata, and uploads the resulting repository to GitHub. They have gitified around 43,000 works this way. The project also converts Project Gutenberg vanilla plain text into ASCIIDOC—a good example of this is the GITenberg edition of The Adventures of Huckleberry Finn. This is an amazingly ambitious project that holds the promise of wide-ranging applications for editing, versioning, and disseminating literature.
One such application might lie with the 68,000 digital texts recently created by the British Library. James Baker, a digital curator of the British Library, left a comment on my original post, suggesting that the method I describe might be used to parse and post the Library’s texts. He sent me a few sample texts of the ALTO XML documents that the Stanford Literary Lab had used. I adapted some of the GITenberg code to read these texts, generate README files for them, and turn them into GitHub repositories. I’m provisionally calling this project Git-Lit.
Git-Lit aims to parse, version control, and post each work in the British Library’s corpus of digital texts. Parsing the texts will transform the machine-readable metadata into human-readable prefatory material; version controlling the texts will allow for collaborative editing and revision of the texts, effectively crowdsourcing the correction of OCR errors; and posting the texts to GitHub will ensure the texts’ visibility to the greater community.
Git-Lit addresses these issues:
git clone
followed by the repository URL. Parent repositories can then be assembled for collections of texts using git submodules. That’s to say, a parent corpus repository might be created for nineteenth-century Bildungsromane, for instance, and that repository would contain pointers to individual texts that themselves are their own repositories.A British Library text contains ALTO XML textual data as well as a Library of Congress METS XML metadata file. Git-Lit does the following:
I ran Git-Lit on the four sample texts in the data
directory, and generated the four GitHub repositories that can be found on the Git-Lit organization. You can read, fork, modify, or comment on the IPython Notebook that does this on the project repository at GitHub.
As this project develops, we’ll create indices for the texts in the form of submodule pointers. Category-based parent repositories might include “17th Century Novels,” “18th Century Correspondence,” or simply “Poetry,” but the categories are not mutually exclusive by necessity. This will allow a literary scholar interested in a particular category to instantly assemble a corpus by git clone
ing the parent repository and checking out its submodules with git submodule update --init --recursive
.
Later, we’ll write a scripts to transform the texts in more useful formats, like ASCIIDOC and TEI XML. This will make archival-quality versions of the texts, and will allow for rich scholarly markup.
Please join this initiative! To contribute, contact me, or find an issue you can tackle on the project issue tracker. Also, feel free to add your own features, restructure the code, or make any other improvements. Pull requests are very welcome!