New DH tool: Tesserae for text mining
It’s been a while since I posted a blog, so I thought I’d share with you a DH project that I was recently asked to collaborate on. It’s called Tesserae, and can be found here: http://tesserae.caset.buffalo.edu/
Tesserae “aims to provide a flexible and robust web interface for exploring intertextual parallels.” So far it’s only available for Latin and Greek texts, but the project is expanding to including English poetry and prose. Essentially, it mines two texts in order to find common words and phrases. One limitation of the project is that at the moment it can only find verbatim occurrences of words and phrases, so case changes, for instance, throw it off. This limits the ability of Tesserae to find all forms of allusion, but its programmers are trying to find ways to expand its abilities.
Though it’s a little far afield of the sort of work I do, I think this tool raises a couple of interesting possibilities as well as problems facing the future of literary studies in the digital age. First is the issue of copyright. Tesserae’s creators are currently in the process of seeking legal counsel to explore the possibility of including 20th century texts in their database. Since the interface does not return a full text version of the original work, it’s possible that it would constitute fair use of a text. If this is the case, Tesserae may turn out to be a really fascinating and comprehensive tool for data mining. It would be really interesting, for instance, too look at patterns of allusions over time, at geographical influences, and so forth. It would also reduce a lot of the legwork it takes for one scholar to parse out of the allusions in a single text, since even an expert human reader is bound to miss at least one or two. Expanding Tesserae’s database to include works published in the latter half of the 20th century would be of particular interest to me and I imagine would be of interest especially to those who work with postmodern fiction. As of right now, there aren’t many (any?) text mining tools that can be used with copyrighted texts, which I think often makes scholars whose work lies in more contemporary periods feel isolated from the DH community.
I do have my qualms about the use of data mining in literary studies, but I think this is a case where this technology could prove especially useful to literary scholars. However, Tesserae still has a long way to go, and it may never be able to find every single allusion with complete accuracy, so I certainly don’t see it supplanting human readers anytime soon. Moreover, Tesserae’s legal woes highlight the issue of accessibility in DH: you need the resources of a large department in order to get the funding to seek out the right people to help you get a project like this off the ground.
Myself and a few of my colleagues have been asked to make suggestions as to which texts (early modern period to 19th century, to avoid copyright issues for now) would be most useful to add as preliminary texts. If you have any suggestions, I’d love to hear them!