banner

It's all in the Head(er): From minimal to optimal use of the TEI Header.

Edward Vanhoutte

edward.vanhoutte@kantl.be

document: CTB-TEC1 (revised draft)
March 2001



Table of Contents




1. Introduction

A TEI-conformant SGML instance is typically characterized by the presence of a mandatory TEI Header which is often referred to as the electronic title page. The TEI Header meets the demands of the scholarly community for an in-file documentation of text-specific metadata. The record of this information is essential for any satisfactory interchange of texts coming from multiple sources, or for which long term uses are envisaged, and can serve multiple goals:

  1. The software which parses and processes the electronic text can at the same time make intelligent use of the metadata provided with the text.
  2. The metadata enables cataloguing and bibliographic referencing.
  3. The scholar who is confronted with the electronic text is always provided with an explicit account of the status of the electronic text.


The TEI Header is introduced by the element <teiHeader> and has four major parts, only the first of which is mandatory:

1. a file description <fileDesc>
contains a full bibliographic description of an electronic file amongst which information about the sources from which the electronic text was derived. Essential for bibliographic referencing and cataloguing.
2. an encoding description <encodingDesc>
documents the relationship between an electronic text and the source or sources from which it was derived. It allows for documenting detailed information about transcription/transliteration principles such as normalization, the treatment of quotations and hyphenation and the levels of interpretation i.e. analytic tagging and encoding applied to the document.
3. a profile description <profileDesc>
provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their settings.
4. a revision description <revisionDesc>
summarizes the revision history for a file, which is important for version control and for resolving questions about the history of a file, especially when a team of scholars is working on the same document.


The full form of a TEI Header is thus:

   <teiHeader>
      <fileDesc> ... </fileDesc>
      <encodingDesc> ... </encodingDesc>
      <profileDesc> ... </profileDesc>
      <revisionDesc> ... </revisionDesc>
   </teiHeader>
while a minimal header takes the form:
   <teiHeader>
      <fileDesc> ... </fileDesc>
   </teiHeader>




2. Proposed Structure

The strenght of this system of one mandatory and three optional elements is that it caters for a wide range of applications in the Humanities. According to the needs and the specific features of the project for which texts are encoded, the encoder can more or less freely decide which set of elements (s)he will use for the in-file documentation of the encoded text. But when applying this freedom of user defined metadata to an enterprise such as ours (i.e. agreeing on a communal set of minimal tagging in order for our texts to be maximally interchangeable and (re)usable) this may turn out to be a poisened gift. Therefore, we should agree on a solid header structure which is a trade-off between completeness and density.

In trying to agree on a minimal header structure it is important to emphasize that this is not a limiting proposal. Bringing the proposal to practice only assumes that the complete structure of the proposal is present in the proposal-conformant electronic text, be it as a basis for more. It is for instant absurd to read this proposal as a prohibition to use <editionStmt> with its typical elements inside <titleStmt> when creating a new edition of an electronic text, e.g. by adding referential links to external datasets.

The following is a list of the possible contents of such a minimal header, and is as a working document under discussion. (Originally deviced as BEB-TEC2 draft, and adopted as CTB-TEC1 draft). The elements are presented in the order in which they are presented in chapter 5 of TEI P3 "The TEI Header" and chapter 20 of TEI U5 "The Electronic Title Page". The proposed structure is further explained in 3. Documentation.



3. Documentation

In this chapter, all the elements of the above structure are documented with their respective and proposed attributes. Extensive use has been made of the aforementioned documents TEI P3, and TEI U5.

All elements in the proposed header structure (as all elements in the TEI Lite DTD) have the following global attributes:

ana
links an element with its interpretation.
corresp
links an element with one or more corresponding elements.
id
unique identifier for the element; must begin with a letter, can contain letters, digits, hyphens, and periods.
lang
language of the text in this element; if not specified, language is assumed to be the same as in the surrounding context.
n
name or number of this element; may be any string of characters. Often used for recording traditional reference systems.
next
links an element to the next element in an aggregate.
prev
links an element to the previous element in an aggregate.
rend
physical realization of the element in the copy text: italic, roman, display block, etc. Value may be any string of characters.


The use of these global attributes is recommended when applicable but the attributes as such do not belong to the proposed header structure.


3.1 The <teiHeader> element

Each header starts with the <teiHeader> tag which carries two attributes:

creator
documents the name of the person who created the header.
date.created
specifies the date on which the header was created. Takes the form yyyy-mm-dd as specified by ISO 8601.


The header ends with the </teiHeader> tag.

Example:

   <teiHeader creator="Edward Vanhoutte" date.created="2000-01-10">
      <!-- header -->
   </teiHeader>



3.2 The File Description

The File Description is a mandatory element in TEI (Lite) and thus in the proposed header structure. It provides full bibliographic information on the electronic file. The File Description element has been closely modelled on exisiting standards in library cataloguing and should provide enough information to enable referencing and cataloguing.

The File Description starts with the <fileDesc> tag and in the proposed structure it consists of three parts: the title statement, the publication statement, and the source description. The File Description ends with the </fileDesc> tag.

Example:

   <fileDesc>
      <titleStmt> ... </titleStmt>
      <publicationStmt> ... </publicationStmt>
      <sourceDesc> ... </sourceDesc>
   </fileDesc>



3.2.1 The Title Statement

The Title Statement <titleStmt> contains the title given to the electronic work together with information on the parties responsible for the contents of the electronic text:

<title>
contains the chief name of the file (the title of a work, whether article, book, journal, or series), including any alternative title or subtitles. It may be repeated if the file has more than one title (perhaps in different languages) and takes whatever form is considered appropriate by its creator. Where the electronic text is derived from an exisiting source text, it is strongly recommended that the title for the former should also be derived from the latter, but that it should be clearly distinguishable from it. For example, do not call the computer file "De teleurgang van den Waterhoek, the first print edition" but call it rather "De teleurgang van den Waterhoek, first print edition: a machine readable transcription" or "A machine readable version of: De teleurgang van den Waterhoek, first print edition". This will distinguish the computer file from the source text in citations and in catalogues which contain descriptions of both types of material. It is also strongly recommended not to use the file name of the electronic text as contents of the <title> element, because a file name is entirely dependent on the user and the computer system in use and thus cannot always easily be transferred from one system to another.
<author>
in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item.
<editor>
documents the secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution, or organization, (or of several of such) acting as editor, compiler, translator, etc.
<principal>
supplies the name of the principal researcher responsible for the creation of an electronic text, i.e. the markup and its contents. The name of the person responsible for the physical data input need not normally be recorded, but in small scale projects the principal researcher and the encoder will often be one and the same person.
<funder>
specifies the name of an individual, institution, or organization responsible for the funding of a project or the electronic creation of a text. The <funder> element consists of:
  • at least one <name> element which contains the name of the funding person or body.
  • optionally an <address> element subdivided in <addrline> elements, holding the address information of the funder.
<respStmt>
supplies a statement of responsibility for someone responsible for the intellectual content of a text, edition, recording, or series, where specialized elements for authors, editors, etc., do not suffice or do not apply. The <respStmt> element consists of:
  • <resp>: contains a phrase describing the nature of a person's intellectual responsibility.
  • <name>: contains a proper noun or a noun phrase.

Examples:

   <titleStmt>
      <title>It's all in the Head(er):
      From minimal to optimal use of the TEI Header.</title>
      <author>Edward Vanhoutte</author>
      <principal>Edward Vanhoutte</principal>
      <funder>
         <name>Centrum voor Teksteditie en Bronnenstudie - CTB</name>
         <address>
            <addrline>Koningstraat 18</addrline>
            <addrline>b-9000 Gent</addrline>
            <addrline>(België)</addrline>
            <addrline>tel: +32 (0)9 265 93 50 x334</addrline>
            <addrline>fax: +32 (0)9 265 93 49</addrline>
            <addrline>email: evanhoutte@kantl.be</addrline>
         </address>
      </funder>
   </titleStmt>


   <titleStmt>
      <title>De gedichten I: a machine readable transcription</title>
      <author>Herman de Coninck</author>
      <respStmt>
         <resp>compiled by</resp>
         <name>Hugo Brems</name>
      </respStmt>
      <principal>Edward Vanhoutte</principal>
   </titleStmt>



3.2.2 The Publication Statement

The Publication Statement <publicationStmt> groups information concerning the publication or distribution of an electronic text. At least one of the first three elements (<publisher>, <distributor>, <authority>) must be present, followed by (one or more of) the other elements as given in the proposed structure:

<publisher>
supplies the name of the organization responsible for the publication or distribution of a bibliographic item; by whose authority a given edition of the file is made public.
<distributor>
provides the name of a person or other agency responsible for the distribution of a text; from whom copies of the text may be obtained.
<authority>
supplies the name of a person or other agency responsible for making an electronic file available, other than a publisher or distributor. The <authority> element is used when a file is not considered formally published, but is nevertheless made available for circulation by some individual or organization (the socalled release authority).
<pubPlace>
contains the name of the place where a bibliographic item was published.
<date>
contains a date in any format.
<idno>
supplies any standard or non-standard number, used to identify a bibliographic item. Optional attribute:
  • type: categorizes the number, for example as in ISBN or other standard series.
<availability>
supplies information about the availability of a text, for example any restrictions on its use or distributions, its copyright status, etc. inside one or more <p> elements. Attributes include:
  • status: supplies a code indentifying the current availability of the text. Sample values include:
    • free: the text is freely available.
    • unknown: the status of a text is unknown.
    • restricted: the text is not freely available.


Examples:

   <publicationStmt>
      <publisher>KANTL.</publisher>
      <pubPlace>Gent</pubPlace>
      <date>2000</date>
      <distributor>Amsterdam University Press. Amsterdam, 2000</distributor>
      <idno type="ISBN">90-5356-441-1</idno>
      <availability status="RESTRICTED">
         <p>&copy; Copyright 2000, Edward Vanhoutte</p>
         <p>Niets uit deze uitgave mag door middel van
         elektronische of andere middelen, met inbegrip van automatische
         informatiesystemen, worden gereproduceerd en/of openbaar gemaakt
         zonder schriftelijke toestemming van de uitgever, uitgezonderd
         korte fragmenten, die uitsluitend voor recensies en onderwijs
         mogen worden geciteerd.</p>
      </availability>
   </publicationStmt>


   <publicationStmt>
      <authority>CTB</authority>
      <pubPlace>Gent</pubPlace>
      <date>2001</date>
      <idno type="internal">CTB-TEC1</idno>
      <availability status="RESTRICTED">
         <p>revised draft version for discussion purposes only</p>
      </availability>
   </publicationStmt>

3.2.3 The Source Description

The Source Description <sourceDesc> records bibliographic details of the source(s) from which computer files are derived or generated. This may be a printed text or a manuscript, another computer file, an audio or video recording of some kind or a combination of these. The bibliographic details are documented inside a <bibl> element:

<bibl>
contains a loosely-structured bibliographic citation.


An electronic file may also have no source, when it is created as an original electronic text. This is signalled inside a <p> element using the formula "No source: created in machine readable form."

If a machine readable text is based not on a printed source but upon another machine-readable text which include a TEI-header, the header information of the latter will have to be incorporated in the header information of the former. TEI P3 chapter 5.2.8 "Computer Files Derived from Other Computer Files" explains how to do that.

Examples:

   <sourceDesc>
      <bibl>Stijn Streuvels, De teleurgang van den Waterhoek,
      Brugge, Excelsior, s.d. (1927). & Amsterdam, L.J. Veen,
      s.d. (1927). Eerste druk.</bibl>
   </sourceDesc>


   <sourceDesc>
      <p>No source: created in machine readable form.</p>
   </sourceDesc>


3.3 The Encoding Description

The Encoding Description is the second mandatory major division in the proposed header structure. Though not formally required by the TEI Guidelines, its use is made mandatory in this proposal, because it specifies the methods and editorial principles which governed the transcription or encoding of the text in hand, and thus it provides an answer to the question about the intellectual integrity of the encoded text.

The Encoding Description starts with the <encodingDesc> tag and in the proposed structure it consists of three parts: the project description, the editorial practices declaration, and the tagging declaration. The File Description ends with the </encodingDesc> tag.

Example:

   <encodingDesc>
      <projectDesc> ... </projectDesc>
      <editorialDecl> ... </editorialDecl>
      <tagsDecl> ... </tagsDecl>
   </fileDesc>



3.3.1 The Project Description

The Project Description <projectDesc> gives a detailed prose description of the aim or purpose for which an electronic file was encoded inside a <p> element

Example:

   <projectDesc>
      <p>This SGML instance was created for the
      Electronic Streuvels Project (ESP): Stijn Streuvels,
      De teleurgang van den Waterhoek. Elektronisch-kritische editie
      door Marcel De Smedt en Edward Vanhoutte. Amsterdam: AUP/KANTL, 2000.
      ISBN: 90-5356-441-1.</p>
   </projectDesc>



3.3.2 The Editorial Practices Declaration

The Editorial Practices Declaration <editorialDecl> provides information on the editorial principles applied during the transmission from the source text to the electronic file. Because this information is of vital importance to know the intellectual status of a text, the documentation should be quite explicit. Therefore, the proposed structure suggests a <list> element whith five <item> child elements named correction, normalization, quotation, hyphenation and interpretation:
<item>correction: ... </item>
states how and under what circumstances corrections have been made in the text. May contain a reference to a list of the corrections made.
<item>normalization: ... </item>
indicates the extent of normalization or regularization of the original source carried out in converting to electronic form.
<item>quotation: ... </item>
specifies editorial practice adopted with respect to quotation marks in the original.
<item>hyphenation: ... </item>
summarizes the way in which hyphenation in a source text has been treated in its encoded version. May contain a reference to a list of the end-of-line hyphenations in the source text.
<item>interpretation: ... </item>
describes the scope of any analytic or interpretive information added to the text in addition to the transcription.

Example:

   <editorialDecl>
      <p>All editorial principles are explained in the chapter
      <ref target="constitutie">Tekstconstitutie</ref>
         <list>
            <item>correction: the text is thoroughly collated against
            the original and proofread several times. Mistakes in the source
            text are corrected inside a <CORR> tag and the
            original reading is documented inside a SIC attribute. The editor
            responsible for the correction is named in a RESP attribute with
            "MD" for Marcel De Smedt and "EV" for Edward Vanhoutte.</item>
            <item>normalization: no normalizations are carried out.</item>
            <item>quotation: all quotation marks are transcribed and
            standardized as data in the text. The source text uses low
            opening marks and high closing marks.</item>
            <item>hyphenation: end-of-line hyphenation has been removed and
            is documented in the chapter <ref target="constitutie">
            end-of-line hyphenation</ref>. All other hyphenation has been
            retained.</item>
            <item>interpretation: no analytical or interpretive encoding
            added.</item>
         </list>
      </p>
   </editorialDecl>



3.3.3 The Tagging Declaration

The Tagging Declaration <tagsDecl> is used to record the following information about the tagging used within a particular text:



The <tagsDecl> element consists of a sequence of <tagUsage> elements, one for each distinct element occurring within the outermost <text> element of a TEI document:

<tagUsage>
supplies information about the usage of a specific element within a <text>. Attributes include:
  • gi: defines the generic identifier.
  • occurs: specifies the number of occurrences of this element within the text.


Example:

   <tagsDecl>
      <tagUsage gi="add" occurs="1">Addition</tagUsage>
      <tagUsage gi="body" occurs="1">Body of a text, excluding front
      or back matter</tagUsage>
      <tagUsage gi="corr" occurs="1">Corrected form</tagUsage>
      <tagUsage gi="del" occurs="1">Deletion</tagUsage>
      <tagUsage gi="div" occurs="75">Subdivision</tagUsage>
      <tagUsage gi="eg" occurs="73">Example</tagUsage>
      <tagUsage gi="emph" occurs="73">Used to mark Examples</tagUsage>
      <tagUsage gi="figure" occurs="29">Figure of any kind</tagUsage>
      <tagUsage gi="gi" occurs="146">Generic identifier</tagUsage>
      <tagUsage gi="head" occurs="65">Heading of subdivision or list</tagUsage>
      <tagUsage gi="hi" occurs="342">Used only to mark text italicized in the
      source text</tagUsage>
      <tagUsage gi="item" occurs="349">Component of a list</tagUsage>
      <tagUsage gi="lb" occurs="2">Linebreak</tagUsage>
      <tagUsage gi="list" occurs="70">Sequence of items</tagUsage>
      <tagUsage gi="note" occurs="26">Annotation</tagUsage>
      <tagUsage gi="p" occurs="199">Paragraph</tagUsage>
      <tagUsage gi="ref" occurs="96">Reference to another location</tagUsage>
      <tagUsage gi="seg" occurs="1">Nestable segment of any kind</tagUsage>
      <tagUsage gi="text" occurs="1">Individual text, unitary or
      composite</tagUsage>
      <tagUsage gi="xref" occurs="1">External reference</tagUsage>
   </tagsDecl>


3.4 The Revision Description

The Revision Description is the third and last mandatory major division in the proposed header structure. Though not formally required by the TEI Guidelines, its use is made mandatory in this proposal because it summarizes the revision history for a file. No change should be made in any TEI-conformant file without corresponding entries being added in a change log. The Revision Description provides the means to do so. It provides essential information for the administration of large numbers of files which are being updated, corrected, or otherwise modified. It proves to be extremely useful for files which are being exchanged between researchers or systems. Without change logs, as provided by this element, it is easy to confuse different versions of a file, or to remain unaware of small but important changes made in the file by some earlier link in the chain of distribution.

The Revision Description starts with the <revisionDesc> tag and ends with the </revisionDesc> tag. It contains one or more <change> elements:

<change>
summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers. It is recommended to give changes in reverse chronological order, most recent first.


Each <change> element has the following child elements:

<date>
contains a date in any format and indicates the date of the change.
<respStmt>
indicates who made the change and in what role. Consists of the following constituents:
  • <name>: contains the name of the person who made the change.
  • <resp>: defines the role of the person who made the change.
<item>
contains one componenent of a list and gives a prose description of the type of change made to the file.


Example:

   <revisionDesc>
      <change>
         <date>2000-01-11</date>
         <respStmt>
            <resp>markup</resp>
            <name>Edward Vanhoutte (EV)</name>
         </respStmt>
         <item>added chapters 3.1-3.4</item>
      </change>
      <change>
         <date>2000-01-10</date>
         <respStmt>
            <resp>markup</resp>
            <name>Edward Vanhoutte (EV)</name>
         </respStmt>
         <item>creation file</item>
      </change>
   </revisionDesc>


4. Conclusion


This document was prepared for the ETCL-BEB joint meeting (Leiden, 24 january 2000) on the standardized use of TEI encoding.
XHTML auteur: Edward Vanhoutte
Last revision: /2001

Valid XHTML 1.0!