TEI class att.datcat

att.datcat

att.datcat provides attributes that are used to align XML elements or attributes with the appropriate Data Categories (DCs) defined by an external taxonomy, in this way establishing the identity of information containers and values, and providing means of interpreting them. [10.5.2 Lexical View 19.3 Other Atomic Feature Values]

Module tei — The TEI Infrastructure

Members att.lexicographic [case colloc def entryFree etym form gen gram gramGrp hom hyph iType lang lbl mood number oRef orth pRef per pos pron re sense subc syll tns usg xr] att.segLike [c cl m pc phr s seg w] binary category f fDecl fs fsDecl numeric string symbol tagUsage taxonomy

Attributes

datcat datcat⚓︎

provides a pointer to a definition of, and/or general information about, (a) an information container (element or attribute) or (b) a value of an information container (element content or attribute value), by referencing an external taxonomy or ontology. If valueDatcat is present in the immediate context, this attribute takes on role (a), while valueDatcat performs role (b).

Status	Optional
Datatype	1–∞ occurrences of teidata.pointer separated by whitespace

valueDatcat valueDatcat⚓︎

provides a definition of, and/or general information about a value of an information container (element content or attribute value), by reference to an external taxonomy or ontology. Used especially where a contrast with datcat is needed.

Status	Optional
Datatype	1–∞ occurrences of teidata.pointer separated by whitespace

targetDatcat targetDatcat⚓︎

provides a definition of, and/or general information about, information structure of an object referenced or modeled by the containing element, by reference to an external taxonomy or ontology. This attribute has the characteristics of the datcat attribute, except that it addresses not its containing element, but an object that is being referenced or modeled by its containing element.

Status	Optional
Datatype	1–∞ occurrences of teidata.pointer separated by whitespace

Example

The example below presents the TEI encoding of the name-value pair <part of speech, common noun>, where the name (key) ‘part of speech’ is abbreviated as ‘POS’, and the value, ‘common noun’ is symbolized by ‘NN’. The entire name-value pair is encoded by means of the element f. In TEI XML, that element acts as the container, labeled with the name attribute. Its contents may be complex or simple. In the case at hand, the content is the symbol ‘NN’.

The datcat attribute relates the feature name (i.e., the key) to the data category ‘part of speech’, while the attribute valueDatcat relates the feature value to the data category common noun. Both these data categories should be defined in an external and preferably open reference taxonomy or ontology.

⚓︎

‘NN’ is the symbol for common noun used e.g. in the CLAWS-7 tagset defined by the University Centre for Computer Corpus Research on Language at the University of Lancaster. The very same data category used for tagging an early version of the British National Corpus, and coming from the BNC Basic (C5) tagset, uses the symbol ‘NN0’ (rather than ‘NN’). Making these values semantically interoperable would be extremely difficult without a human expert if they were not anchored in a single point of an established reference taxonomy of morphosyntactic data categories. In the case at hand, the string ‘http://hdl.handle.net/11459/CCR_C-1256_7ec6083c-23d4-224d-6f94-eecbe6861545’ is both a persistent identifier of the data category in question, as well as a pointer to a shared definition of common noun.

While the symbols ‘NN’, ‘NN0’, and many others (often coming from languages other than English) are implicitly members of the container category ‘part of speech’, it is sometimes useful not to rely on such an implicit relationship but rather use an explicit identifier for that data category, to distinguish it from other morphosyntactic data categories, such as gender, tense, etc. For that purpose, the above example uses the datcat attribute to reference a definition of part of speech. The reference taxonomy in this example is the CLARIN Concept Registry.

If the feature structure markup exemplified above is to be repeated many times in a single document, it is much more efficient to gather the persistent identifiers in a single place and to only reference them, implicitly or directly, from feature structure markup. The following example is much more concise than the one above and relies on the concepts of feature structure declaration and feature value library, discussed in chapter 19 Feature Structures.

⚓︎

The assumption here is that the relevant feature values are collected in a place that the annotation document in question has access to — preferably, a single document per linguistic resource, for example an fsdDecl that is XIncluded as a sibling of text or a child of encodingDesc; a taxonomy available resource-wide (e.g., in a shared header) is also an option.

The example below presents an fvLib element that collects the relevant feature values (most of them omitted). At the same time, this example shows one way of encoding a tagset, i.e., an established inventory of values of (in the case at hand) morphosyntactic categories.

⚓︎

Note that these Guidelines do not prescribe a specific choice between datcat and valueDatcat in such cases. The former is the generic way of referencing a data category, whereas the latter is more specific, in that it references a data category that represents a value. The choice between them comes into play where a single element — or a tight element complex, such as the f/symbol complex illustrated above — make it necessary or useful to distinguish between the container data category and its value.

Example

In the context of dictionaries designed with semantic interoperability in mind, the following example ensures that the pos element is interpreted as the same information container as in the case of the example of <f name="POS"> above.

⚓︎

Efficiency of this type of interoperable markup demands that the references to the particular data categories should best be provided in a single place within the dictionary (or a single place within the project), rather than being repeated inside every entry. For the container elements, this can be achieved at the level of tagUsage, although here, the valueDatcat attribute should be used, because it is not the tagUsage element that is associated with the relevant data category, but rather the element pos (or case, etc.) that is described by tagUsage:

<tagsDecl partial="true">

<namespace name="http://www.tei-c.org/ns/1.0">
  <tagUsage gi="pos"
   targetDatcat="http://hdl.handle.net/11459/CCR_C-396_5a972b93-2294-ab5c-a541-7c344c5f26c3">Contains the part of speech.</tagUsage>
  <tagUsage gi="case"
   targetDatcat="http://hdl.handle.net/11459/CCR_C-1840_9f4e319c-f233-6c90-9117-7270e215f039">Contains information about the grammatical case that the described form is inflected for.</tagUsage>

</namespace>
</tagsDecl>

⚓︎

Another possibility is to shorten the URIs by means of the prefixDef mechanism, as illustrated below:

<listPrefixDef>
<prefixDef ident="ccr" matchPattern="pos"
  replacementPattern="http://hdl.handle.net/11459/CCR_C-396_5a972b93-2294-ab5c-a541-7c344c5f26c3"/>
<prefixDef ident="ccr" matchPattern="adj"
  replacementPattern="http://hdl.handle.net/11459/CCR_C-1230_23653c21-fca1-edf8-fd7c-3df2d6499157"/>
</listPrefixDef>

<entry>

<form>
  <orth>isotope</orth>
</form>
<gramGrp>
  <pos datcat="ccr:pos"
   valueDatcat="ccr:adj">adj</pos>
</gramGrp>

</entry>

⚓︎

This mechanism creates implications that are not always wanted, among others, in the case at hand, suggesting that the identifiers ‘pos’ and ‘adj’ belong to a namespace associated with the CLARIN Concept Repository (CCR), whereas that is solely a shorthand mechanism whose scope is the current resource. Documenting this clearly in the header of the dictionary is therefore advised.

Yet another possibility is to associate the information about the relationship between a TEI markup element and the data category that it is intended to model already at the level of modeling the dictionary resource, that is, at the level of the ODD, in the equiv element that is a child of elementSpec or attDef.

Example

The taxonomy element is a handy tool for encoding taxonomies that are later referenced by att.datcat attributes, but it can also act as an intermediary device, for example holding a fragment of an external taxonomy (or ‘flattening’ an external ontology) that is relevant to the project or document at hand. (It is also imaginable that, for the purpose of the project at hand, the local taxonomy element combines vocabularies that originate from more than one external taxonomy or ontology.) In such cases, the taxonomy creates a local layer of indirection: the att.datcat attributes internal to the resource may reference the category elements stored in the header (as well as the taxonomy element itself), whereas these same category and taxonomy elements use att.datcat attributes to reference the original taxonomy or ontology.

<encodingDesc>

<classDecl>

  <taxonomy xml:id="UD-SYN"
   datcat="https://universaldependencies.org/u/dep/index.html">
   <desc>
    <term>UD syntactic relations</term>
   </desc>
   <category xml:id="acl"
    valueDatcat="https://universaldependencies.org/u/dep/acl.html">
    <catDesc>
     <term>acl</term>: Clausal modifier of noun (adjectival clause)</catDesc>
   </category>
   <category xml:id="acl_relcl"
    valueDatcat="https://universaldependencies.org/u/dep/acl-relcl.html">
    <catDesc>
     <term>acl:relcl</term>: relative clause modifier</catDesc>
   </category>
   <category xml:id="advcl"
    valueDatcat="https://universaldependencies.org/u/dep/advcl.html">
    <catDesc>
     <term>advcl</term>: Adverbial clause modifier</catDesc>
   </category>

  </taxonomy>
</classDecl>
</encodingDesc>

⚓︎

The above fragment was excerpted from the GB subset of the ParlaMint project in April 2023, and enriched with att.datcat attributes for the purpose of illustrating the mechanism described here.

Note that, in the ideal case, the values of att.datcat attributes should be persistent identifiers, and that the addressing scheme of Universal Dependencies is treated here as persistent for the sake of illustration. Note also that the contrast between datcat used on taxonomy on the one hand, and the valueDatcat used on category on the other, is not mandatory: both kinds of relations could be encoded by means of the generic datcat attribute, but using the former for the container and the latter for the content is more user-friendly.

Example

The targetDatcat attribute is designed to be used in, e.g., feature structure declarations, and is analogous to the targetLang attribute of the att.pointing class, in that it describes the object that is being referenced, rather than the referencing object.

<fDecl name="POS"
targetDatcat="http://hdl.handle.net/11459/CCR_C-396_5a972b93-2294-ab5c-a541-7c344c5f26c3">
<fDescr>part of speech (morphosyntactic category)</fDescr>
<vRange>
  <vAlt>
   <symbol value="NN"
    datcat="http://hdl.handle.net/11459/CCR_C-1256_7ec6083c-23d4-224d-6f94-eecbe6861545"/>
   <symbol value="NP"
    datcat="http://hdl.handle.net/11459/CCR_C-1371_fbebd9ec-a7f4-9a36-d6e9-88ee16b944ae"/>

  </vAlt>
</vRange>
</fDecl>

⚓︎

Above, the fDecl uses targetDatcat, because if it were to use datcat, it would be asserting that it is an instance of the container data category part of speech, whereas it is not — it models a container (f) that encodes a part of speech. Note also that it is the f that is modeled above, not its values, which are used as direct references to data categories; hence the use of datcat in the symbol element.

Example

The att.datcat attributes can be used for any sort of taxonomies. The example below illustrates their usefulness for describing usage domain labels in dictionaries on the example of the Diccionario da Lingua Portugueza by António de Morais Silva, retro-digitised in the MORDigital project.

<encodingDesc>
<classDecl>
  <taxonomy xml:id="domains">

   <category xml:id="domain.medical_and_health_sciences">
    <catDesc xml:lang="en">Medical and Health Sciences</catDesc>
    <catDesc xml:lang="pt">Ciências Médicas e da Saúde</catDesc>
    <category xml:id="domain.medical_and_health_sciences.medicine"
     valueDatcat="https://vocabs.rossio.fcsh.unl.pt/pub/morais_domains/pt/page/0025">
     <catDesc xml:lang="en">
      <term>Medicine</term>
      <gloss>

      </gloss>
     </catDesc>
     <catDesc xml:lang="pt">
      <term>Medicina</term>
      <gloss>

      </gloss>
     </catDesc>
    </category>
   </category>

  </taxonomy>
</classDecl>
</encodingDesc>

<usg type="domain"
valueDatcat="#domain.medical_and_health_sciences.medicine">Med.</usg>

⚓︎

In the Morais dictionary, the relevant domain labels are in the header, getting referenced inside the dictionary, from usg elements. The vocabulary used for dictionary-internal labelling is in turn anchored in the MorDigital controlled vocabulary service of the NOVA University of Lisbon – School of Social Sciences and Humanities (NOVA FCSH).

Note

The TEI Abstract Model can be expressed as a hierarchy of attribute-value matrices (AVMs) of various types and of various levels of complexity, nested or grouped in various ways. At the most abstract level, an AVM consists of an information container and the value (contents) of that container.

A simple example of an XML serialization of such structures is, on the one hand, the opening and closing tags that delimit and name the container, and, on the other, the content enclosed by the two tags that constitues the value. An analogous example is an attribute name and the value of that attribute.

In a TEI XML example of two equivalent serializations expressing the name-value pair <part-of-speech,common-noun>, namely <pos>commonNoun</pos> and pos="common-noun", one would classify the element pos and the attribute pos as containers (mapping onto the first member of the relevant name-value pair), while the character data content of pos or the value of pos would be seen as mapping onto the second member of the pair.

The att.datcat class provides means of addressing the containers and their values, while at the same time providing a way to interpret them in the context of external taxonomies or ontologies. Aligning e.g. both the pos element and the pos attribute with the same value of an external reference point (i.e., an entry in an agreed taxonomy) affirms the identity of the concept serialised by both the element container and the attribute container, and optionally provides a definition of that concept (in the case at hand, the concept part of speech).

The value of the att.datcat attributes should be a PID (persistent identifier) that points to a specific — and, ideally, shared — taxonomy or ontology. Among the resources that can, to a lesser or greater extent, be used as inventories of (more or less) standardized linguistic categories are the GOLD ontology, CLARIN CCR, OLiA, or TermWeb's DatCatInfo, and also the Universal Dependencies inventory, on the assumption that its URIs are going to persist. It is imaginable that a project may choose to address a local taxonomy store instead, but this risks losing the advantage of interchangeability with other projects.

Historically, datcat and valueDatcat originate from the (now obsolete) ISO 12620:2009 standard, describing the data model and procedures for a Data Category Registry (DCR). The current version of that standard, ISO 12620-1, does not standardize the serialization of pointers, merely mentioning the TEI att.datcat as an example.

Note that no constraint prevents the occurrence of a combination of att.datcat attributes: the fDecl element, which is a natural bearer of the targetDatcat attribute, is an instance of a specific modeling element, and, in principle, could be semantically fixed by an appropriate reference taxonomy of modeling devices.

TEI: Guidelines for Electronic Text Encoding and Interchange

att.datcat