Texts and Transcriptions: mapping scribal complexities onto a line of text.

Edward Vanhoutte

The transcription of modern manuscript material is the core activity of a couple of newly initiated electronic editing projects at the Centre for Scholarly Editing and Document Studies (Centrum voor Teksteditie en Bronnenstudie) of the Royal Academy of Dutch Language and Literature in Ghent, Belgium (Koninklijke Academie voor Nederlandse Taal- en Letterkunde). In order for these projects to result in the publication of versioning editions, representing multiple texts (Reiman 1987) - in facsimile as well as in machine-readable form, in concordances, stemmata, lists of variants, etc. - transcriptions of all the extant material have to be made in a platform independent and non-proprietary markup language which can deal with the linguistic and the bibliographic text of a work and which can guarantee maximal accessibility, longevity and intellectual integrity (Sperberg-McQueen, 1994 & 1996: 41). The encoding schemes proposed by the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard 1994) have generally been accepted as the most promising solution to this. The transcription of primary source material using TEI enables automatic collation, stemma (re)construction, the creation of (cumulative) indexes and concordances etc. by the computer.

Although the TEI subsets for the transcription of primary source material "have not proved entirely satisfactorily" for a number of problems (Driscoll 2000), transcription and digitization guidelines for older texts can be produced on the basis of the TEI encoding scheme , (Robinson 1994, and Robinson & Solopova 1993). The transcription of modern manuscript material using TEI proves to be of a more problematic nature because of at least two essential characteristics of such complex source material: namely the notions of time and of overlapping hierarchies. Since SGML (and thus XML) was devised on the assumptions that a document is a logical construct that contains one or more trees of elements that make up the documents content (Goldfarb 1995: 18), several scholars began to theorize about the assumption that text is an ordered hierarchy of content objects (OHCO thesis), which always nest properly and never overlap,[1] and the difficulties attached to this claim.[2]

The TEI Guidelines propose five possible methods to handle non-nesting information,[3] but state that

"Non-nesting information poses fundamental problems for any encoding scheme, and it must be stated at the outset that no solution has yet been suggested which combines all the desirable attributes of formal simplicity, capacity to represent all occurring or imaginable kinds of structures, suitability for formal or mechanical validation, and clear identity with the notations needed for simpler cases (i.e. cases where the textual features do nest properly). The representation of non-hierarchical information is thus necessarily a matter of choices among alternatives, of tradeoffs between various sets of different advantages and disadvantages." (chapter 31 "Multiple Hierarchies" of the TEI Guidelines)

The editor using an encoding scheme for the transmission of any feature of a modern manuscript text to a machine-readable format, is essentially confronted with the dynamic concept of time which constitutes non-hierarchical information. Whereas the simple representation of a (printed) prose text can be thought of as a logical tree of hierarchical and structural elements such as book, part, chapter, and paragraph, and an alternative tree of hierarchical and physical elements such as volume, page, column, and line-structures which can be applied to the wide majority of printed texts and medieval manuscripts-, the modern manuscript shows a much more complicated web of interwoven and overlapping relationships of elements and structures.

Modern manuscripts, as Almuth Grésillon defines them, are "manuscrits qui font partie d'une genèse textuelle attestée par plusieurs témoins successifs et qui manifestent le travail d'écriture d'un auteur." ["manuscripts which are part of a textual genesis for which many consecutive witnesses give evidence, and which are the manifestation of the author's labour of writing." (my translation).](Grésillon 1994: 244) The French school of Critique Génétique primarily deals with modern manuscripts and their primary aim is to study the avant-texte, not so much as the basis to set out editorial principles for textual representation, but as a means to understand the genesis of the literary work or as Daniel Ferrer put it: "it does not aim to reconstitute the optimal text of a work; rather, it aims to reconstitute the writing process which resulted in the work, based on surviving traces, which are primarily author's draft manuscripts" (Ferrer 1995: 143). Therefore, the structural unit of a modern manuscript is not the paragraph, nor the page or the chapter, but the temporal unit of writing. These units form a complex network which are often not bound to the chronology of the page.

The application of hypertext technology and the possibility to display digital facsimiles in establishing electronic dossiers génétiques, provides the editor with a multiplicity of ways in which s/he can regroup a series of documents which are akin to each other on the basis of resemblance or difference. The experiments with proprietary software systems (Hypercard, Toolbook, Macromedia, PDF, etc.), however, are too much oriented towards display, and often do not comply with the rule of "no digitization without transcription" (Robinson 1997).

Further, the TEI solutions for the transcription of primary source material do not cater for modern manuscripts because the current (P4) and previous versions of the TEI have never addressed the encoding of the time factor in text. Since a writing process by definition takes place in time, four central complications may arise in connection with modern manuscripts and should thus be catered for in en encoding scheme for the transcription of modern primary source material. The complications are the following:

  1. Its beginning and end may be hard to determine and its internal composition difficult to define (document structure vs. unit of writing): authors frequently interrupt writing, leave sentences unfinished and so on.
  2. Manuscripts frequently contain items such as scriptorial pauzes which have immense importance in the analysis of the genesis of a text.
  3. Even non-verbal elements such as sketches, drawings, or doodles may be regarded as forming a component of the writing process for some analytical purposes.
  4. Below the level of the chronological act of writing, manuscripts may be segmented into units defined by thematic, syntactic, stylistic, etc. phenomena; no clear agreement exists, however, even as to the appropriate names for such segments.

These four complications are exactly the ones the TEI Guidelines cite when trying to define the complexity of speech, emphasizing that "Unlike a written text, a speech event takes place in time." (Sperberg-McQueen and Burnard 2001: 254). This may suggest that the markup solutions employed in the transcription of speech could prove useful for the transcription of modern manuscripts, in particular the chapter in the TEI Guidelines on Linking, Segmentation, and Alignment (esp. 14.5. Synchronization).

This paper will deal with the practicalities underlying the production of electronic scholarly editions and will report on the results of a research stay at the Wittgenstein Archives in Bergen with a EU Research Infrastructure grant in June-July 2002.