English Diachronic Corpus Linguistic: Prague Themes

Ondřej Tichý & Jan Čermák

Faculty of Arts, Charles University

Recent years saw a growing popularity of quantitative historical linguistics in Prague, especially application of the methodologies of Digital Humanities. We will introduce the Prague take on diachronic corpus linguistics by focusing on some of these current issues the Department of English Language has been engaged in pursuing.

In one of our early attempts at applying corpus methodology to a Praguian approach, we explored Skalička’s and Sgall’s theory of morphological typology and focused on the quantification of typological change in the history of English. This study, introducing information entropy as a measure of predictability of a system and applying it to the exponents of inflectional categories throughout the history of English, confirmed the major direction of typological change in English, but it also pointed out some intriguing trends in more recent history. Since entropy proved to be well-suited to determine the regularization and simplification of morphological paradigm, we used it again in an attempt to quantify the standardization of English spelling, where it likewise proved to be a relatively intuitive measure rather than a mere form : type ratio.

Both the richness of early English morphology and the irregularity of its spelling brought our attention to the paucity of highly annotated (i.e. lemmatized, morphologically tagged) corpus material for Old and Middle English and gave rise to the project of automatic morphological analysis of Old English that uses pre-generated dictionaries and rule-based grammar in a manner analogous to the morphological analysis of highly inflected languages, such as Czech, but with a special focus on formal variation. To capture the maximum lexical breadth and the fullest formal variety, Bosworth’s and Toller’s Anglo-Saxon Dictionary was chosen as the source of lexical data for the analyser. That, as the end product of the project, is therefore based on the results of an ongoing digitization project of the Online Anglo-Saxon Dictionary (www.bosworthtoller.com) that both supplies the data to, and profits from, the morphological analysis, allowing it to connect with multiple external resources.

The following project of ours thematised the issue of lexical obsolescence and loss in historical English.  It was first inspired by an interest in the interface between typological changes in the word-formation and lexical mortality on the Old/Middle English threshold, which, due to the nature of preserved data, virtually had to avoid corpus-based methodology. For that reason, we extended the focus of our interest in lexical obsolescence and loss to include the Late Modern English period as well where much larger data available allowed for a more robust quantitative analysis that went hand in hand with further refinement of the method.

Apart from the issue of quantification of language change, we also focused on the visualization of its progress throughout the linguistic community. One of such studies was carried out on the PCEEC, a letter corpus richly annotated with sociolinguistic meta-data that by the nature of the epistolary material it contains is essentially a graph, or a network, and we employed network analysis tools such as Gephi to map lexical innovation and its spread through the language community.

The topics to be presented – though varied and on different levels of linguistic description, spelling, morphology and lexis – all display important facets of systemic interconnection and as such are just different aspects of the same endeavour that has characterised Praguian approach over the past hundred years – keen attentiveness to language structure combined with sustained search for methodological advancement.

