An Introduction to the TEI and the TEI Consortium

Edward Vanhoutte


For the benefit of the readers and the authors of this special issue on electronic scholarly editing, this short article introduces the Text Encoding Initiative (TEI) and the TEI Consortium (TEI-C), which are mentioned and referred to throughout this issue and provides many useful suggestions for further reading.

1. Representation of data

Since in 1949 Father Roberto Busa S.J. began his work on the Index Thomisticus, a lemmatized concordance to the works of Thomas Aquinas, the first volume of which only appeared in 1973 (Busa, [1965]; 1976, 1997), numerous theories, tools, and techniques have been developed that concentrate respectively on digitization, imaging, and OCR for capturing and inputting data; concordance making, sorting, text retrieval, lemmatizing, collation, statistical analysis, stylometry, and stemma (re)construction for analysing text; and typesetting, visualization, hypertext, and distribution/interchange for the publication of electronic texts. An extensive treatment of the functionality and history of each and all of these is beyond the scope of this short article—a history which can be reconstructed from Susan Hockey's latest book Electronic Texts in the Humanities (2000), which gives an overview of the methodologies and principles that govern humanities computing, and the tools and techniques used. In what follows I will concentrate on the logical culmination of all three of these evolutions (input, analysis, and output) in the creation of a uniform system for text encoding and interchange.

Since the earliest uses of computers and computing techniques in the humanities, projects had to look for systems that could provide representations of data which the computer could process. Since, as Michael Sperberg-McQueen in a seminal article in 1991 formulated crisply:

Computers can contain and operate on patterns of electronic charges, but they cannot contain numbers, which are abstract mathematical objects not electronic charges, nor texts, which are complex, abstract cultural and linguistic objects. (p. 34)

Until the 1980s, every scholar, project, software package, or research group more or less devised its own system of representation which was either structural or typographic in its approach.[1] This of course resulted in a series of proprietary encoding schemes, the functionality of which most often could not transcend the boundaries of the project or tool it had been created for. The call for reusability, interchange, system- and software-independence, portability, and collaboration in the humanities was answered by the advent of the Standard Generalized Markup Language (SGML) which became an ISO standard in 1986 (ISO 8879:1986) (Goldfarb, 1990). SGML is not in itself a markup scheme, but a methodology that enables the creation of such schemes. Based on IBM's Document Composition Facility Generalized Markup Language, SGML was developed mainly by Charles Goldfarb to become a metalanguage for the description of markup schemes that satisfied at least seven requirements for an encoding standard:[2]

  1. The requirement of comprehensiveness;
  2. The requirement of simplicity;
  3. The requirement that documents be processable by software of moderate complexity;
  4. The requirement that the standard not be dependent on any particular characteristic set or text-entry devise;
  5. The requirement that the standard not be geared to any particular analytic program or printing system;
  6. The requirement that the standard should describe text in editable form; and
  7. The requirement that the standard allow the interchange of encoded texts across communication networks.

Such a markup scheme was exactly what the humanities were looking for in their quest for an encoding standard for the preparation and interchange of electronic texts for scholarly research.

2. From P1 to P4

Shortly after the SGML standard was published, a diverse group of 32 humanities computing scholars met at Vassar College in Poughkeepsie, New York (11-12 November, 1987) and agreed on a set of methodological principles—the so called Poughkeepsie Principles—which formed the basis for the Text Encoding Initiative (TEI).[3] The TEI soon came to adopt SGML as its basis because it was believed that SGML offered a better foundation for research oriented text encoding than other schemes.[4] The first public proposal for the TEI Guidelines was published in 1990 as P1 (for Proposal 1). The further development of the TEI Guidelines was done in four Working Committees and 17 Working Groups, one of which was the working group on Text Criticism (TR2) chaired by Peter Robinson,[5] and another one on Manuscripts and Codicology (TR9) chaired by Claus Huitfeldt.[6] The TEI community took these guidelines through several rounds of testing, debates, and revisions, to the monumental P3 Guidelines for Electronic Text Encoding and Interchange, which was first published in 1994 as a 1,292 page documentation of the definitive guidelines. The Guidelines define some 600 elements which can be used for the encoding of

texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. They treat both continuous materials ('running text') and discontinuous materials such as dictionaries and linguistic corpora (Chapter 1 of the TEI Guidelines).

The partners in the work of the TEI were The Association for Computers and the Humanities (ACH), The Association for Computational Linguistics (ACL), and The Association for Literary and Linguistic Computing (ALLC). The US National Endowment for the Humanities, the European Union, the Canadian Social Science Research Council, the Andrew W. Mellon Foundation, and others have provided fuding over the years.

After the eXtensible Markup Language (XML) 1.0 was published as a W3C recommendation in 1998, followed by the second edition in by Bray et al. in 2000, the P4[7] revision of the TEI Guidelines was announced mid-2001 by the newly formed TEI Consortium in order to provide equal support for XML and SGML applications using the TEI scheme. The conclusions and the work of the TEI community are formulated as guidelines, rules, and recommendations rather than standards, because it is acknowledged that each scholar must have the freedom of expressing his or her own theory of the text by encoding the features he or she thinks important in the text. A wide array of possible solutions to encoding matters is demonstrated in the TEI Guidelines which therefore should be considered a reference manual rather than a tutorial. Mastering the complete TEI encoding scheme implies a steep learning curve, but few projects require a complete knowledge of the TEI. Therefore, a manageable subset of the full TEI encoding scheme was published as TEI Lite, describing 131 elements. Originally intended as an introduction and a stepping stone to the full recommendations, TEI Lite has, since its publication in 1995, become the most popular TEI DTD (Document Type Definition) and proves to meet the needs of 90% of the TEI community, 90% of the time (with about 20% of the elements). (Burnard & Sperberg_McQueen, 1995).

3. P5

As the TEI-C website points out, the revisions needed to make TEI P4 have been deliberately restricted to error correction only, with a view to ensuring that document conforming to TEI P3 will not become illegal when processed with TEI P4. During this process, however, many possibilities for other, more fundamental, changes have been identified. With the establishment of the nem TEI Council, which superintends the technical work of the TEI-C, it becomes possible to agree on a programme of work to enhance and modify the Guidelines more fundamentally over the coming years. TEI P5 will be the next full revision of the Guidelines, using Schama instead of DTDs, and it is guaranteed that P5 will not be backwards compatible. The TEI Copnsortium will, however, maintain and error correct the P4 Guidelines, so future users will have to choose between P4 and P5.

The P5 will have at least two major changes:

Recently, the TEI-Consortium asked their membership to convene Special Interest Groups (SIGs) whose aim could be to advise revision of certain chapters of the Guidelines and sugges changes and improvements in view of the P5. The SIG on manuscript transcription first met on the TEI Members Meeting in Nancy in November 2003, and the need for a total revision of Chapter 18 (Transcription of Primary Resources) and 19 (Critical Apparatus) was voiced.[8] It is suspected that these revisions will include ways to encode, e.g. genetic transcriptions and codicological descriptions, and will approach transcription both from the perspectives ot the text and the manuscript.

4. The TEI Markup Principles

From the Poughkeepsie Principles the TEI concluded that the TEI Guidelines should:

As mentioned above, SGML and XML (strictly speaking an SGML system application) are not markup languages themselves, but metalanguages by which one can create separate markup languages for separate purposes. This means that SGML and XML define the rules and procedures to specify the vocabulary and the syntax of a markup language in a formal DTD. Such a DTD is a formal description of, e.g. names for all elements, names, and default values for their attributes, rules about how elements can nest, and names for re-usable pieces of data (entities). The TEI created not just one DTD, but a collection of tag sets (also known as element sets or DTD fragments) which combine to one or more DTDs. Some of these tagsets are required, some are basic, and some are optional. The users of the TEI can select the convenient tagsets for inclusion in their DTD(s). The selection and generation of TEI compliant DTDs is facilitated by the TEI Pizza Chef,[9] an on-line tool which allows anyone to bake his or her own personalized view of the TEI document type definition in SGML or XML.

The DTD fragments from which a TEI compliant DTD is constructed may be classified as follows:

5. The TEI Consortium

The TEI Consortium was established in 2000 as a not-for-profit membership organisation to sustain and develop the TEI. The Consortium has its executive office at Bergen (Norway) and hosts at the University of Bergen, Brown University, Oxford University, and the University of Virginia, with Lou Burnard (of Oxford) as European editor, and Syd Bauman (of Brown) as North American Editor. The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council.

The TEI charter outlines the consortium's goals and fundamental principles. Its goals are:

  1. To establish and maintain a home for the TEI in the form of a permanent organizational structure.
  2. To ensure the continued funding of TEI-C activities, for example: editorial maintenance and development of the TEI guidelines and DTD, training and outreach activities, and services to members.
  3. To create and maintain a governance structure for the TEI-C with broad representation of TEI user-communities.

Its four fundamental principles are:

  1. The TEI guidelines, other documentation, and DTD should be free to users;
  2. Participation in TEI-C activities should be open (even to non-members) at all levels;
  3. The TEI-C should be internationally and interdisciplinarily representative;
  4. No role with respect to the TEI-C should be without term.

Involvement in the consortium is possible in three categories: voting membership, which is open to individuals, institutions or projects; non-voting subscription, which is open to personal individuals only, and sponsorship, which is open to individual or corporate sponsors.[12] Only members have the right to vote on consortium issues and in elections to the Board and the Council, have access to a restricted website with pre-release drafts of Consortium working documents and technical reports; announcements and news, and a database of members, sponsors, and subscribers, with contact information, and benefit from discounts on training, consulting, and certification. The Consortium members meet annually at a Members' Meeting where current critical issues in text encoding are discussed, and members of the Council and members of the Board of Directors are elected. The membership fee payable varies depending on the kind of project or institution and its location depending on where the economy of the member's country falls in the four-part listing of Low, Lower-Middle, Middle-Upper, and High Income Economies, as currently listed on the World Bank's Web site.




© 2004 ALLC and Edward Vanhoutte

This text is published as Edward Vanhoutte "An Introduction to the TEI and the TEI Consortium." in: Mats Dahlström, Espen S. Ore, & Edward Vanhoutte (eds.), Electronic Scholarly Editing—Some Northern European Approaches. A Special Issue of Literary and Linguistic Computing, 19(1) (2004): 9-16.

XHTML auteur: Edward Vanhoutte
Last revision: 14/06/2004

[Home] [Bio] [Volledig CV] [Consult] [Onderzoek] [Onderwijs] [Koken] [Contact]

Valid XHTML 1.0!