Mining (Con)texts

10/20/2008 - 11:54pm
Scholar
Printer-friendly version

John Unsworth gave a talk at Harvard tonight teasingly titled "How Not to Read a Million Books: Text Mining, and Reading the Unreadable." He spoke mostly about the MONK project,a Mellon-funded collaboration that's familiar, I'm sure, to many HASTACers.MONK applies text mining techniques and visualizations to discover newdimensions to literary and historical texts.

Unsworth describedthe work of several scholars already using the MONK toolkit in theirwork. For instance, Tanya Clement, a PhD candidate in EnglishLiterature and Digital Studies, has successfully applied MONK to herresearch on Gertrude Stein's The Making of Americans, as she described in a recent article for Literary and Linguistic Computing:

"The particular reading difficulties engendered by the complicated patterns of repetition in The Making of Americansmirror those a reader might face attempting to read a large collectionof like texts at once without getting lost?likewise, it is almostimpossible to read this text in a traditional, linear manner. However, byvisualizing certain patterns and looking at the text ?from a distance?through textual analytics and visualizations, we are enabled to makereadings that were formerly inhibited. Franco Moretti has arguedthat the solution to truly incorporating a more global perspective inour critical literary practices is not to read more of the vast amountsof literature available to us, but to read it differently by employing'distant reading'. 'We know how to read texts', he writes, ?now let'slearn how not to read them' (Moretti, 2000Go, p. 57). Similarly, bylearning to read texts that have been misread 'at a distance', we arereading differently and we value different readings."

SaraSteger, a PhD candidate in English at University of Georgia, issimilarly using MONK in her study of sentimentalism innineteenth-century novels. Not only could she train the program torecognize sentimental scenes, she then was able to mine a collection oftexts for over-represented words in, for instance, Victorian deathbedscenes:

And, then, under-represented words in those same scenes:

Her results invite new research into the absence of formal expressions ofmourning ("holy," "country," "lord"), and the presence of physical and emotional closeness ("pillow," "cheek," "breath") in deathbed scenes.

I want to underscore that I think these tools do offerincredible, never-before-possible ways of looking at texts. But: Iwonder about how slippery the word "text" becomes in the phrase "textmining." MONK and similar projects focus narrowly on "text" as astring of letters than can plucked from any material context, ploppedinto another and manipulated, "mined," for meaning. Let's assume for asecond that the OCR software always works perfectly (it doesn't), andthat the scans of our target book have picked up all the paratexts,including the copyright page, advertisements, promotional blurbs, evenpage numbers. Then take that lump of letters and drop itinto a text file. What are you left with? What so-called "accidentals,"what context, has been lost in translation?

I'm reminded of Kenneth Goldsmith's book Day,in which he re-typed one day's New York Times word for word, from theupper left hand corner to the lower right hand corner, including pagenumbers and any text in advertisements. The resulting book -- a thicktome that essentially levels the dynamic space of the newspaper --might have been the newspaper . . . but definitely was not the newspaper.

Thesequestions become relevant particularly for Victorian novels, many ofthem stuffed with advertising and illustrations, or published seriallyin magazines alongside political cartoons or recipes. The Wordlescreated from deathbed scenes are fascinating and very exciting to me;but unless they're paired with some old-school bibliographic analysis,I worry that more has been elided from the text than it's worth. I alsowonder (given my own interests) how text mining would work for earlymodern books, many of which may ascribe meaning and significance to"accidentals" like italics, capitalization and typographic variation.Unsworth acknowledged that text mining should only be one tool in theresearcher's toolkit. What, then, would a combination of MONK-like textmining and bibliography would look like? How can we apply "distantreading" to texts-as-strings-of-letters, while simultaneously doing a"close reading" of texts-as-material-objects?

Cathy Davidson

Critical Contexts

Hi, Whitney, I think you are right that, as with any other tool, one needs to supply context, critical thinking and critical skills, to any data one mines. That's why HASTAC was created around three conjoined areas: creative design and innovation of tools for research and teaching and making art; critical thinking about those tools and about technology in society in general; participatory learning (using whatever tools are available to us now to be able to think through complex interdisciplinary problems together). If one keeps all those three things in mind, then data mining is great because it offers a macro-survey of something that, of course, then is susceptible to serious, sustained, critical thinking, debate, interpretation, and theorizing. I find something similar with other forms of data gathering, such as genomics. I just read a terrific essay by novelist Richard Powers in GQ about being one of ony 9 people to have his entire genome sequenced. What was interesting is after the whole gruelling process he was left with scads of data, all of which demands far more (not less) interpretation.

 

That's the thing about data: it OPENS, rather than CLOSES, the scope of what we, as interpretive humanists, need to do. The more data, the more we need to use our training to understand its complexity, its implications, and its nuances, whether in text mining or genome expression.

 

travis

I think part of the problem

I think part of the problem is that the tools and corpora we currently have in the humanities provide a very shallow view of the data. For example, a corpus of unstructured text

whitneyt

thanks, + filters

Thanks for the great comments, I think you're both spot-on. Travis, you've got me thinking about the idea of filters again. I think this might be a more direct way to state the concerns I was thinking about in the post: I often hear text-mining discussed as if we're (to bring this verb back!) cleaning the text, transforming it from this tangible, messy thing stored on bound leaves of paper into clean, manipulable data; but in reality, we've just replaced one filter with another. Sometimes changing the filter is great, and yields interesting new information -- but it's still a filter, manmade system built out of certain assumptions about our language, etcetcetc. I've been thinking a lot this last month about how it often seems that we don't question how new tools mediate our relationships with texts with as much vigor as we question, for instance, how textbook anthologies shape our relationship with a visual poet like William Blake. In fact, new tools are still often posited as the "solution," since they offer the opportunity to publish facsimiles of Blake's work. Well, okay, that solves one problem (how to show the poems in their illustrated context); but it introduces new ones.

arthurkukri

It's really interesting what

It's really interesting what people will do with the ordinary in order to transform it into the extraordinary.  As for that book by Goldsmith, I always wondered if he had to get permission from the New York Times to do that or if it was covered by some type of artistic license in some way.