THDL Toolbox > Places & Geography > THDL Gazetteer > Gazetteer DTD Decoded For Dummies
Contributor(s): THDL Staff.
The THDL Gazetteer is the basis for THDL’s study and archiving of information on Tibetan and Himalayan places. In this sense it can be understood as a dictionary of place names, with limited additional information – types of places, location, relationships, and brief descriptions. It's an index of all “features” in the region, features being a generic term signifying all places of any type that have a geographical footprint, whether they be regions (nations, counties, cultural regions, and so forth) or points (specific lakes, villages, archaeological sites, and so forth). Thus it includes natural as well as cultural features.
The Gazetteer currently uses a XML Database to store information, which is then transformed for display and searching over the Web. The Database is driven by a custom designed "DTD", or Document Type Definition, which essentially means a toolbox of elements/tags that the University of Virginia has designed for use in making geographical databases and marking up the entries therein. This document provides an explanation of the DTD that governs the XML structure of the Tibetan and Himalayan Digital Library's "Gazetteer of Tibet and the Himalayas." The explanations are written for semi-technical readers who need to understand the workings of the DTD to facilitate their work in developing/editing Gazetteer records. If your need is more technically demanding, this is still a good first stop in understanding the purpose , design, and anatomy of the DTD. Editors who are just trying to make Gazetteer entries without actually worrying about the XML should instead consult the Gazetteer User Guide.
The current, soon-to-be-replaced beta version of the gazetteer may be seen in THDL, and the same URL will give access to the full 1.0 release of the Gazetteer when it is ready to deploy.
- Introduction to DTD's and XML
- Example: How The Relationship Between the Gazetteer DTD and XML Works
- Overview of the Gazetteer DTD's Main Sections
- Guide to Selected Elements and their Attributes
- Guide to Entities
Each Gazetteer feature's entry information is maintained in a flat XML file (more below on XML), which is transformed for display, or searched, via the Web. The structure of and rules governing what type of data may or must be entered about the Gazetteer and about each feature, and the criteria by which it must be entered, is determined by a custom designed "DTD", or Document Type Definition. The DTD is essentially a map of the elements that the University of Virginia has designed for geographical databases and marking up (preparing for use) the entries therein. It thus functions as the blueprint for the data, or the organizing structure imposed on the "storage facility". In this case the storage facility for the data is the XML.
In order to generate a web page showing information from the Gazetteer, the XML data is transformed using XSLT (the T is for transformation) via an XSL (eXtensible Stylesheet Language) stylesheet, which determines how that data will be displayed for end users. It consists of a series of rules specifying how various elements within the XML should be displayed, and applies those rules to create HTML pages for end display on the THDL website.
The present document provides an easy to read explanation of this DTD for humanists trying to actually understand the technicalities of the DTD and the function it serves. It refers frequently to the actual Gazetteer DTD, but relevant parts of the DTD will be copied into this document as needed to illustrate. As you start to understand the way the actual DTD for the Gazetteer works and become familiar with its contents, this document will start to seem like the long way around, and you will find reading the DTD itself much quicker.
XML stands for eXtensible Markup Language. It is a flexible system that provides for customized storage and, later, display of content. An XML file is composed of a series of "tags" in <angle brackets>, some of which are single <self closing /> tags, some of which require matching <tag>opening and closing</tag> tags. Unlike HTML, which has a predefined set of tags to control structural display of information on the resulting Web page, XML allows (and also obliges) you to create your own tags for defining content. Control of the display of the content is done via "transformations" that push the content of the XML through XSL stylesheets (that's one way) to generate the resulting XHTML that is displayed on the Web page. Again, it is the DTD that constrains and regulates the infinite possibilities of the XML tags you could create.
"Elements" are the principal components of an XML structure. Some elements have "attributes" with "values" that provide extra information about the element. In all such cases, the attributes and values allowed to be used in any element are defined and constrained by the DTD. If you tried to use other attributes or values, the resulting XML text would not "validate" and the system would fail to publish information via the web site. Perhaps the clearest example is the "type" attribute of the <frel> element. It constrains the description of the type of relationship to part of, adjacent, near, intersects, other.
The basic syntax for one element in XML is:
In the following example:
<fname lang="eng" resp="THDL"> <geogname>Tibet</geogname> </fname>
… we see that fname is an element; lang is an attribute of fname, with the value "eng". geogname is also an element within fname that contains content, i.e. Tibet. Note the proper "nesting" of the tags - first to open (fname) is last to close.
A few last details about XML publishing. There are two standards of coding that must be met with XML. When XML code is said to be "Valid", it means that it was validated against a DTD or a Schema (a more advanced form of a DTD) and found to be in compliance with the rules set therein. Failing in validity can result in errors - serious enough that the XML/XSLT editing application oXygen used to be defaulted to refuse to save a non-valid xml file. An XML file must also be "Well Formed", meaning that it must meet with general XML tag rules, e.g. any tag that is opened must be closed or self-closed, nesting rules must be observed, you may not miss a matching quote around an attribute, etc. Failure here will break the XML parser and result in a broken Web page.
Now we'll provide an actual example of XML from a Gazetteer record and show how it refers back to the DTD's structure. Once you understand these relationships and the coding that makes them work, most of the DTD will then be easy to understand.
We've already seen this bit of code from the XML file for the Tibetan Autonomous Region itself (f1.xml):
<fname lang="eng" resp="THDL"> <geogname>Tibet</geogname> </fname>
But would this code Validate against the DTD? Is it written according to the rules therein? What if you wanted to add more kinds of data - is that allowed? How can you find out?
One way is to use a validator, but for the present purpose that would be cheating; instead we will go into the DTD and find the rules for the elements and attributes used above.
The first element we see is 'fname'.
(DAN LEFT OFF HERE)
The following symbols indicate whether an element is required as well as the permissible number of incidences of it:
- ? Zero or one of the item
- + One or more of the item
- * Zero of more of the item
This overview provides a brief rundown of the main sections of the DTD and what each is for.
Note first that the DTD has "actual" code and also "comments", non-code messages or notes or section titles written by the programmer to explain pieces of code or to just help separate into sections what would otherwise be a rather daunting and unreadable (even to the programmer) file. Comments in the DTD are written like so starting and ending with <!-- and -->:
<!-- comment text; section titles, explanations, etc etc -->
…whereas actual DTD code looks something like this, without the dashes used to make comments:
<!ELEMENT fheader ((fclass|ftype|fcode)*, authority?)>
Now, each of the main sections of the DTD has a title, in comments. Here are the commented titles in order of occurrence in the file, and a brief explanation of each. There are two general kinds of these - Entities and Elements.
First, three "entity" groups are defined. "Entities" are variables used to define shortcuts to a single place for referring to frequently used bits of text. For example, if you saw something like "%fishtypes" in many element definitions, it would mean that in the entities section you would find something like <!ENTITY % fishtypes "trout|bass|salmon">. Entities are handy because if you want to add, say, "perch", you only have to add it to the entity, instead of having to change that list of fish types in every element where they occur.
<!-- Data-type Entities --> - These define entities for Date, URI and Language codes.
<!-- Attribute Class Entities --> - These define entities for commonly used attributes.
<!-- Attribute Value Entities --> - These define entities for commonly used attribute values.
<!-- Element Model Entities --> - Eclectic group of entitites for editorial markup, text formatting, hypertext inking and geometry classes for geographical information.
<!-- Structural elements --> - These could be said to be the primary elements of physical Gazetteer records, along with their attributes. Very important.
<!-- Bibliographic elements --> - These are what you would expect; not much usage to date in Gazetteer records.
<!-- Date and time elements --> - These regulate dates associated with an event in the life cycle of the resource. Typically, date will be associated with the creation or availability of the resource.
<!-- Geometry elements --> - Used to define spatial elements for GIS-related data.
<!-- Additional text elements --> - A bit of a grab bag of elements and related attributes regulating XML from footnote handling to HTML rules to postal addresses and further bibliographic types. Remember to look here if you haven't found your element in more obvious places.
<!-- Low-level, e.g. inline, elements --> - A handful of eclectic items here.
<!-- Inline commentary elements --> - Place related text regulations.
<!-- Hypertext linking elements --> - These elements constrain coding for linking to internal and external resources.
Although they come after the Entities in the DTD, we're listing the Elements first here because they are of more general concern to THDL developers. As mentioned above, nearly all of the data in the XML relates to elements and attributes in the Structural Elements section of the DTD. Those elements and attributes are explicated as follows, along with the Hypertext Linking Elements, which are also currently in use and important to understand.
There are many other elements that are not covered yet in this documentation; you'll have to ask for help if you cannot parse their entries from the DTD istelf.
Please visit the respective links for actual DTD entry for each and further explanations of their use and attributes:
- feature: This is the top level element in the description of each feature in the Gazetteer - it is that which marks an entry in the Gazetteer.
- fheader: Admin info about the feature. the identification of the type of feature it is. It will generally be drawn from a standardized thesaurus of feature types such as the feature thesaurus for the Alexandria Digital Library.
- fclass: Feature classification. The <fclass> element contains feature classification data, e.g. populated place, river, etc.; must include type of feature thesuarus and a term from that thesaurus, e.g., <fclass term="autonomous region" type="administrative"/>.
- fcode: <fcode> contains a feature code, which may or may not be based on feature location, place in a hierarchy, etc
- fname: This is the name of the feature. A feature in the Gazetteer will generally have multiple names. They may be entirely distinct like 'Kailash' vs. 'gangs rin po che', or they may be a variant of the same name such as a transliteration from Tibetan to an easy-to-pronounce English spelling.
- translation: When used, indicates the linguistic relationship of the particular fname to the primary fname of the feature.
- transliteration: When used, indicates the linguistic relationship of the particular fname to the primary fname of the feature.
- mixed: When used, indicates the linguistic relationship of the particular fname to the primary fname of the feature.
- geogname: Geogname is a container for the actual name of the feature.
- etymology: Explanatory text on the origin and development of a geographic name. Currently, etymology is not used but we are leaving it in place.
- place: This is the spatial location of the feature. It could be a point with just latitude and longitude coordinates or the geometric data used to describe it in a GIS application, etc. Changes in location or dimensions of a feature over time are handled with the time sub-element on place.
- time: The time during which a geogname was in use or in a place context, the time during which the feature existed at the given location.
- fdesc: This is a brief textual description of the feature.
- frel: This is the element in which the feature's relationship to other features is described.
- frelgrp : A group of related features. This can be used to group together certain features based on whatever criteria determined by the maker of the gazetteer. For example one may choose to group together buildings by a type such as 'earthen structure'.
- linkgrp: Group of pointers or refs. Use linkgrp to provide multiple, alternative or simultaneously available, pointer or reference targets.
- ptr: An internal linking element which provides for movement from one place in the gazetteer to another.
- extptr: An empty linking element which connects the gazetteer to an external (most often, binary) electronic object, such as an image.
- ref: An internal linking element which provides for movement within a gazetteer. Unlike the ptr element, the ref element may contain text or other elements that identify or describe the referenced object.
- extref: A linking element which connects the gazetteer to an external (most often text) electronic object, such as another XML or HTML document.
These are the "entities" or 'shortcuts' for commonly used bits of text or code.
Visit the respective links for actual DTD entry for each and further explanations of their use:
- %ISODATE : Date in ISO format, e.g. YYYYMMDD.
- %URI : A URN or URL.
- %LANGUAGE : Language code.
- %a.bibl : Bibliographic data.
- %a.common : id, label and type.
- %agent.form : Agent types.
- %format.form : Dublin core format types.
- %frel.role : Feature relation types.
- %link.rel : Relationships among sibling linking elements.
- %loc.type : Location (URI) types. Necessary for systems where both URLs and URNs are in use.
- %emph.rend : Visual renderings.
- %title.form : Dublin Core title types.
- %certaintyvalues : Levels of certainty.
- %priorityvalues : Levels of priority.
- %m.comment : These elements allow editorial markup of text.
- %m.geometryclasses : Feature geometry.
- %m.inline : Basic text formatting elements (lb and emph), hypertext anchors, and index terms.
- %m.linkelements : Hypertext linking elements.
Provided for unrestricted use by the Tibetan and Himalayan Digital Library