An Introduction to the TEI and the TEI Consortium

Edward Vanhoutte

Abstract

For the benefit of the readers and the authors of this special issue on electronic scholarly editing, this short article introduces the Text Encoding Initiative (TEI) and the TEI Consortium (TEI-C), which are mentioned and referred to throughout this issue and provides many useful suggestions for further reading.

1. Representation of data

Since in 1949 Father Roberto Busa S.J. began his work on the Index Thomisticus, a lemmatized concordance to the works of Thomas Aquinas, the first volume of which only appeared in 1973 (Busa, [1965]; 1976, 1997), numerous theories, tools, and techniques have been developed that concentrate respectively on digitization, imaging, and OCR for capturing and inputting data; concordance making, sorting, text retrieval, lemmatizing, collation, statistical analysis, stylometry, and stemma (re)construction for analysing text; and typesetting, visualization, hypertext, and distribution/interchange for the publication of electronic texts. An extensive treatment of the functionality and history of each and all of these is beyond the scope of this short article—a history which can be reconstructed from Susan Hockey's latest book Electronic Texts in the Humanities (2000), which gives an overview of the methodologies and principles that govern humanities computing, and the tools and techniques used. In what follows I will concentrate on the logical culmination of all three of these evolutions (input, analysis, and output) in the creation of a uniform system for text encoding and interchange.

Since the earliest uses of computers and computing techniques in the humanities, projects had to look for systems that could provide representations of data which the computer could process. Since, as Michael Sperberg-McQueen in a seminal article in 1991 formulated crisply:

Computers can contain and operate on patterns of electronic charges, but they cannot contain numbers, which are abstract mathematical objects not electronic charges, nor texts, which are complex, abstract cultural and linguistic objects. (p. 34)

Until the 1980s, every scholar, project, software package, or research group more or less devised its own system of representation which was either structural or typographic in its approach.[1] This of course resulted in a series of proprietary encoding schemes, the functionality of which most often could not transcend the boundaries of the project or tool it had been created for. The call for reusability, interchange, system- and software-independence, portability, and collaboration in the humanities was answered by the advent of the Standard Generalized Markup Language (SGML) which became an ISO standard in 1986 (ISO 8879:1986) (Goldfarb, 1990). SGML is not in itself a markup scheme, but a methodology that enables the creation of such schemes. Based on IBM's Document Composition Facility Generalized Markup Language, SGML was developed mainly by Charles Goldfarb to become a metalanguage for the description of markup schemes that satisfied at least seven requirements for an encoding standard:[2]

The requirement of comprehensiveness;
The requirement of simplicity;
The requirement that documents be processable by software of moderate complexity;
The requirement that the standard not be dependent on any particular characteristic set or text-entry devise;
The requirement that the standard not be geared to any particular analytic program or printing system;
The requirement that the standard should describe text in editable form; and
The requirement that the standard allow the interchange of encoded texts across communication networks.

Such a markup scheme was exactly what the humanities were looking for in their quest for an encoding standard for the preparation and interchange of electronic texts for scholarly research.

2. From P1 to P4

Shortly after the SGML standard was published, a diverse group of 32 humanities computing scholars met at Vassar College in Poughkeepsie, New York (11-12 November, 1987) and agreed on a set of methodological principles—the so called Poughkeepsie Principles—which formed the basis for the Text Encoding Initiative (TEI).[3] The TEI soon came to adopt SGML as its basis because it was believed that SGML offered a better foundation for research oriented text encoding than other schemes.[4] The first public proposal for the TEI Guidelines was published in 1990 as P1 (for Proposal 1). The further development of the TEI Guidelines was done in four Working Committees and 17 Working Groups, one of which was the working group on Text Criticism (TR2) chaired by Peter Robinson,[5] and another one on Manuscripts and Codicology (TR9) chaired by Claus Huitfeldt.[6] The TEI community took these guidelines through several rounds of testing, debates, and revisions, to the monumental P3 Guidelines for Electronic Text Encoding and Interchange, which was first published in 1994 as a 1,292 page documentation of the definitive guidelines. The Guidelines define some 600 elements which can be used for the encoding of

texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. They treat both continuous materials ('running text') and discontinuous materials such as dictionaries and linguistic corpora (Chapter 1 of the TEI Guidelines).

The partners in the work of the TEI were The Association for Computers and the Humanities (ACH), The Association for Computational Linguistics (ACL), and The Association for Literary and Linguistic Computing (ALLC). The US National Endowment for the Humanities, the European Union, the Canadian Social Science Research Council, the Andrew W. Mellon Foundation, and others have provided fuding over the years.

After the eXtensible Markup Language (XML) 1.0 was published as a W3C recommendation in 1998, followed by the second edition in by Bray et al. in 2000, the P4[7] revision of the TEI Guidelines was announced mid-2001 by the newly formed TEI Consortium in order to provide equal support for XML and SGML applications using the TEI scheme. The conclusions and the work of the TEI community are formulated as guidelines, rules, and recommendations rather than standards, because it is acknowledged that each scholar must have the freedom of expressing his or her own theory of the text by encoding the features he or she thinks important in the text. A wide array of possible solutions to encoding matters is demonstrated in the TEI Guidelines which therefore should be considered a reference manual rather than a tutorial. Mastering the complete TEI encoding scheme implies a steep learning curve, but few projects require a complete knowledge of the TEI. Therefore, a manageable subset of the full TEI encoding scheme was published as TEI Lite, describing 131 elements. Originally intended as an introduction and a stepping stone to the full recommendations, TEI Lite has, since its publication in 1995, become the most popular TEI DTD (Document Type Definition) and proves to meet the needs of 90% of the TEI community, 90% of the time (with about 20% of the elements). (Burnard & Sperberg_McQueen, 1995).

3. P5

As the TEI-C website points out, the revisions needed to make TEI P4 have been deliberately restricted to error correction only, with a view to ensuring that document conforming to TEI P3 will not become illegal when processed with TEI P4. During this process, however, many possibilities for other, more fundamental, changes have been identified. With the establishment of the nem TEI Council, which superintends the technical work of the TEI-C, it becomes possible to agree on a programme of work to enhance and modify the Guidelines more fundamentally over the coming years. TEI P5 will be the next full revision of the Guidelines, using Schama instead of DTDs, and it is guaranteed that P5 will not be backwards compatible. The TEI Copnsortium will, however, maintain and error correct the P4 Guidelines, so future users will have to choose between P4 and P5.

The P5 will have at least two major changes:

The root elements will be <TEI> and <TEIcorpus> instead of <TEI.2> and <teiCorpus.2>
The elements will all be in the namespace http://www.tei-c.org/P5/

Recently, the TEI-Consortium asked their membership to convene Special Interest Groups (SIGs) whose aim could be to advise revision of certain chapters of the Guidelines and sugges changes and improvements in view of the P5. The SIG on manuscript transcription first met on the TEI Members Meeting in Nancy in November 2003, and the need for a total revision of Chapter 18 (Transcription of Primary Resources) and 19 (Critical Apparatus) was voiced.[8] It is suspected that these revisions will include ways to encode, e.g. genetic transcriptions and codicological descriptions, and will approach transcription both from the perspectives ot the text and the manuscript.

4. The TEI Markup Principles

From the Poughkeepsie Principles the TEI concluded that the TEI Guidelines should:

Provide a standard format for data interchange;
Provide guidance for encoding of texts in this format;
Support the encoding of all kinds of features of all kinds of texts studied by researchers;
Be application independent.

As mentioned above, SGML and XML (strictly speaking an SGML system application) are not markup languages themselves, but metalanguages by which one can create separate markup languages for separate purposes. This means that SGML and XML define the rules and procedures to specify the vocabulary and the syntax of a markup language in a formal DTD. Such a DTD is a formal description of, e.g. names for all elements, names, and default values for their attributes, rules about how elements can nest, and names for re-usable pieces of data (entities). The TEI created not just one DTD, but a collection of tag sets (also known as element sets or DTD fragments) which combine to one or more DTDs. Some of these tagsets are required, some are basic, and some are optional. The users of the TEI can select the convenient tagsets for inclusion in their DTD(s). The selection and generation of TEI compliant DTDs is facilitated by the TEI Pizza Chef,[9] an on-line tool which allows anyone to bake his or her own personalized view of the TEI document type definition in SGML or XML.

The DTD fragments from which a TEI compliant DTD is constructed may be classified as follows:

Core Tag Sets: are always required and contain the teiHeader DTD[10] and elements available in all TEI documents.[11]
Base Tag Sets: define the basic building blocks of different text types. Following selections are available:
- Prose: this tagset is suitable for most documents most of the time;
- Verse: this tagset adds specialist tagging for metrical analysis, rhyme-scheme etc. to the basic verse markup already included in the core;
- Drama: this tagset adds specialist tagging for cast lists, records of first performance, etc. to the basic drama markup already included in the core;
- Speech: this tagset replaces the basic structure by one suitable for linguistic analysis of speech acts, etc.;
- Dictionaries: this tagset replaces the basic structure with one containing detailed lexicographic features;
- Terminology: this tagset replaces the basic structure with one specific to terminological databases;
- General base: this tagset allows you to combine tags from different base tagsets, with the proviso that any single text division can contain tags from only one of the base tagsets you choose from the following list: prose, verse, drama, spoken texts, dictionaries, terminology:
- Mixed base: this tagset allows you to combine tags from different base tagsets, with no restriction at all as to where tags from different base tagsets can appear. The different tagsets to combine are: prose, verse, drama, spoken texts, dictionaries, terminology.
Additional Tag Sets: may be selected and are optional:
- linking: adds elements for hypertext linking, segmentation, and alignment;
- figures: adds elements for encoding tables, pictures, and formulae;
- analysis: adds elements for interpretation and simple linguistic analyses;
- fs: Adds elements for feature structure analysis;
- certainty: adds elements for recording uncertainty and responsibility;
- transcr: adds elements for the transcription of primary sources (e.g. manuscripts);
- textcrit: adds elements for text-critical apparatus;
- names.dates: adds elements for the detailed tagging of names and dates;
- nets: adds elements for recording the abstract structure of mathematical graphs, networks, and trees;
- corpora: adds specialized elements to the TEI-header for use with language corpora.

5. The TEI Consortium

The TEI Consortium was established in 2000 as a not-for-profit membership organisation to sustain and develop the TEI. The Consortium has its executive office at Bergen (Norway) and hosts at the University of Bergen, Brown University, Oxford University, and the University of Virginia, with Lou Burnard (of Oxford) as European editor, and Syd Bauman (of Brown) as North American Editor. The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council.

The TEI charter outlines the consortium's goals and fundamental principles. Its goals are:

To establish and maintain a home for the TEI in the form of a permanent organizational structure.
To ensure the continued funding of TEI-C activities, for example: editorial maintenance and development of the TEI guidelines and DTD, training and outreach activities, and services to members.
To create and maintain a governance structure for the TEI-C with broad representation of TEI user-communities.

Its four fundamental principles are:

The TEI guidelines, other documentation, and DTD should be free to users;
Participation in TEI-C activities should be open (even to non-members) at all levels;
The TEI-C should be internationally and interdisciplinarily representative;
No role with respect to the TEI-C should be without term.

Involvement in the consortium is possible in three categories: voting membership, which is open to individuals, institutions or projects; non-voting subscription, which is open to personal individuals only, and sponsorship, which is open to individual or corporate sponsors.[12] Only members have the right to vote on consortium issues and in elections to the Board and the Council, have access to a restricted website with pre-release drafts of Consortium working documents and technical reports; announcements and news, and a database of members, sponsors, and subscribers, with contact information, and benefit from discounts on training, consulting, and certification. The Consortium members meet annually at a Members' Meeting where current critical issues in text encoding are discussed, and members of the Council and members of the Board of Directors are elected. The membership fee payable varies depending on the kind of project or institution and its location depending on where the economy of the member's country falls in the four-part listing of Low, Lower-Middle, Middle-Upper, and High Income Economies, as currently listed on the World Bank's Web site.

Notes

1. The best known early encoding systems are COCOA and the Beta-transcription/encoding system. Cf. (Hockey, 2000: 24-48; Russel, 1967; Lancashire et al., 1996; Berkowitze and Squiter, 1986). [back]
2. Requirements taken from (Barnard et al., 1988: 28-29). [back]
3. For a history of the TEI, see (Ide and Sperberg-McQueen, 1995: 6-7), (Mylonas and Renear, 1999) and both chapters 1. About These Guidelines in the TEI P3 and P4 Guidelines (Sperberg-McQueen and Burnard, 1999, 2002). The P4 Guidelines are available from http://www.tei-c.org/P4X/, the previous and superseded versions from the TEI Vault http://www.tei-c.org/Vault/Vault-GL.html. [back]
4. Cf. (Fraser, 1986) and (Barnard et al., 1988). [back]
5. Other members were: David Chesnutt, Robin Cover, Robert Kraft, and Peter Shillingsburg. [back]
6. Other members were: Dino Buzzetti, Jacqueline Hamesse, Mary Keeler, Christian Kloesel, Allen Renear, and Donald Spaeth. [back]
7. The XML compatible P4 Guidelines can be found at http://www.tei-c.org/P4X/. [back]
8. The SIG was convened and led by Elena Pierazzo, Susan Schreibman, and Edward Vanhoutte. [back]
9. http://www.tei-c.org/pizza.html. The name 'Pizza Chef' refers to the metaphor which the TEI editors used to describe the construction of the TEI DTD, and which they called the Pizza Model: when ordering a pizza, one always gets tomato sauce and cheese (core), one can choose which crust one wants (base), and order the toppings (additional). [back]
10. Chapter 5 of the TEI Guidelines. [back]
11 Chapter 6 of the TEI Guidelines. [back]
12. The terms and benefits of each class of participation can be found on the TEI website. [back]

References

Barnard, David T., Cheryl A. Fraser, & George M. Logan (1988). 'Generalized Markup for Literary Texts.' In: Literary and Linguistic Computing, 3(1): 26-31.
Berkowitz, L. and Squiter, K. A. (1986). Thesaurus Linguae Graecae, Canon of Greek Authors and Works. New York/Oxford: Oxford University Press.
Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, & Eve Maler (eds.) (2000). Extensible Markup Language (XML) 1.0 (Second Edition). http://www.w3.org/TR/REC-xml
Burnard, Lou and C.M. Sperberg-McQueen (1995). TEI Lite: An Introduction to Text Encoding for Interchange (TEI U5). http://www.tei-c.org/Lite/index.html
Busa, Roberto ([1965]). 'An Inventory of Fifteen Million Words.' In: Literary Data processing Conference Proceedings. J.B. Bessinger and S.M. Parrish (eds.). White Plains. pp. 64-78.
Busa, Roberto (1976). 'Computer Processing of over Ten Million Words: Retrospective Criticism.' In: The Computer in Literary and Linguistic Studies. (Proceedings of the Third International Symposium). Alan Jones and R.F. Churchhouse (eds.). Cardiff: University of Wales Press. pp. 114-117.
Busa, Roberto (1997). 'Concluding a Life's Safari from Punched Cards to World Wide Web.' In: The Digital Demotic: Selected Papers from DRH97, Digital Resources for the Humanities Conference, St. Anne's College, Oxford, September 1997. L. Burnard, M. Deegan, & H. Short (eds.) London: Office for Humanities Communication Publications 10. pp. 3-11.
Fraser, Cheryl A. (1986). An Encoding Standard for Literary Documents. M.S. thesis, Queen's University, Ontario.
Goldfarb, C.E. (1990). The SGML Handbook. Oxford: Clarendon.
Hockey, Susan (2000). Electronic Texts in the Humanities. Principles and Practice. Oxford: Oxford University Press.
Ide, Nancy M. and C.M. Sperberg-McQueen (1995). 'The TEI: History, Goals, and Future.' In: Computers and the Humanities, 29: 5-15.
Lancashire, I., J. Bradley, W. McCarty, M. Stairs, and T. R. Wooldridge (1996). Using TACT with Electronic Texts. New York: Modern Language Association of America.
Mylonas, Elli and Allen Renear (1999). 'The Text Encoding Initiative at 10: Not Just an Interchange Format Anymore — But a New Research Community'. In: Computers and the Humanities, 33: 1-9.
Russel, D.B (1967). COCOA: A Word Count and Concordance Generator for Atlas. Chilton, UK.
Sperberg-McQueen, C.M. (1991). 'Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts.' In: Literary and Linguistic Computing, 6(1): 34-46.
Sperberg-McQueen, C.M., and L. Burnard (eds.) (1999). Guidelines for Electronic Text Encoding and Interchange. TEI P3. Revised reprint. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative.
Sperberg-McQueen, C.M. and L. Burnard (eds.) (2002). TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML Version. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.

URLs

Association for Computers and the Humanities (ACH): http://www.ach.org
Association for Computational Linguistics (ACL): http://www1.cs.columbia.edu/~acl/home.html
Association for Literary and Linguistic Computing (ALLC): http://www.allc.org
TEI Consortium (TEI): http://www.tei-c.org
World Wide Web Consortium (W3C) : http://www.w3c.org

This text is published as Edward Vanhoutte "An Introduction to the TEI and the TEI Consortium." in: Mats Dahlström, Espen S. Ore, & Edward Vanhoutte (eds.), Electronic Scholarly Editing—Some Northern European Approaches. A Special Issue of Literary and Linguistic Computing, 19(1) (2004): 9-16.

XHTML auteur: Edward Vanhoutte
Last revision: 14/06/2004

[Home] [Bio] [Volledig CV] [Consult] [Onderzoek] [Onderwijs] [Koken] [Contact]