PopYard:Today's Tech.-Web Semantics: Computing and Visualizing the 19th-Century Li

Fri Nov 29 05:42:55 2024

Web Semantics: Computing and Visualizing the 19th-Century Li
Source: Bruce Sterling

XML
Jockers, Matthew, Stanford University, USA, mjockers@stanford.edu
overview

In literary studies, we have no shortage of anecdotal wisdom regarding the role of influence on creativity. Consider just a few of the most prominent voices:

‘Talents imitate, geniuses steal’ �C Oscar Wilde (1854-1900?).1
‘All ideas are second hand, consciously and unconsciously drawn from a million outside sources’ �C Mark Twain (1903).
‘The historical sense compels a man to write not merely with his own generation in his bones, but with a feeling that the whole of the literature … has a simultaneous existence’ �C T. S. Eliot (1920).
‘The elements of which the artwork is created are external to the author and independent of him …’ �C Osip Brik (1929).
Anxiety of Influence �C Harold Bloom (1973).
Whether consciously influenced by a predecessor or not, it might be argued that every book is in some sense a necessary descendant of, or necessarily ‘connected to, ’ those before it. Influence may be direct, as when a writer models his or her writing on another writer,2 or influence may be indirect in the form of unconscious borrowing. Influence may even be ‘oppositional’ as in the case of a writer who wishes to make his or her writing intentionally different from that of a predecessor. The aforementioned thinkers offer informed but anecdotal evidence in support of their claims of influence. My research brings a complementary quantitative and macroanalytic dimension to the discussion of influence. For this, I employ the tools and techniques of stylometry, corpus linguistics, machine learning, and network analysis to measure influence in a corpus of late 18th- and 19th-century novels.

method
The 3,592 books in my corpus span from 1780 to 1900 and were written by authors from Britain, Ireland, and America; the corpus is almost even in terms of gender representation. From each of these books, I extracted stylistic information using techniques similar to those employed in authorship attribution analysis: the relative frequencies of every word and mark of punctuation are calculated and the resulting data winnowed so as to exclude features not meeting a preset relative frequency threshold.3 From each book I also extracted thematic (or ‘topical’) information using Latent Dirichlet Allocation (Blei, Ng et al. 2003; Blei, Griffiths et al. 2004; Chang, Boyd-Graber et al. 2009). The thematic data includes information about the percentages of each theme/topic found in each text.4 I combine these two categories of data �C stylistic and thematic �C to create ‘book signals’ composed of 592 unique feature measurements. The ‘Euclidian” metric is then used to calculate every book’s distance from every other book in the corpus. The result is a distance matrix of dimension 3,592 x 3,592.5

While measuring and tracking ‘actual’ or ‘true’ influence �C conscious or unconscious �C is impossible, it is possible to use the stylistic-thematic distance/similarity measurements as a proxy for influence.6 Network visualization software can then be used as a way to organize, visualize, and study the presence of influence among of books in my corpus.7 To prepare the data for use in a network environment, I converted the distance matrix into a long-form table with 12,902,464 rows and three columns in which each row captures a distance relationship between two books. The first cell contains a ‘source’ book, the second cell a ‘target’ book, and a third cell the measured distance between the two. After removing all of the records in which the target book was published before, or in the same year as, the source book,8 the data was reduced from 12,902,464 records to 6,447,640. This data and a separate table of metadata were then imported into the open source network analysis software package Gephi (2009) for analysis and visualization.

analysis
Networks are constructed out of nodes (books) and edges (distances). When plotted, nodes with less similarity (i.e. larger distances between them) will spread out further in the network. Figure 1 offers a simplified example of three imaginary books….

}