THL Toolbox > Essays > Creating an XML Essay for THDL > Applying THL Word Style
Contributor(s): THL Staff.
The conversion routine is a work in progress. Ideally, each style that is available in the TextToXMLConverter.doc would be marked up in the appropriate XML when converted. We are presently working on the document that lists the correspondences between the styles and the XML markup, while at the same time working on implementing each in the conversion routine. However, at present, some of the more specific, less-used styles are not converted. The most essential ones for structuring the XML document and basic formatting of the text are available. (The discussion is specifically focuses on the conversion process. For a more general discussion of the use of styles in word, see our discussion in the Using Microsoft Word Styles manual.
The initial step in the process is to paste the original document into a copy of the TextToXMLConverter.doc. Rename it appropriately, and this will save both the style information and the Visual Basic macros with the new document. Go through the document and apply the appropriate styles to the paragraphs and words as necessary. There are two kinds of styles available in Microsoft Word. Paragraph styles apply to the whole paragraph of text, while character styles apply only to a certain character or run of characters anywhere within a paragraph. A header style is a paragraph style. Italics are a form of character style. Both kinds of styles are used in the mark-up process.
The basic principle to the conversion is that the metadata table at the top contains all the metadata, while the structure of the text is represented by the nested headers. Heading 1 represents the major divisions of the text, while Heading 2 represent the sub-divisions of those divisions. Heading 3 represent the sub-divisions of the sub-divisions, and so forth. Other than that, tables and lists should be formatted and styled appropriately, and if desired, one can apply specific styles to indicate personal names, place names, titles, and so forth. There is also a function within the conversion routine that will go through all italics in the document and allow the use to apply more specific styles in order to differentiate between titles, foreign languages, names, and other uses of italics. The description of the style/XML correspondences is found at:
A summary of the style to markup principles are:
- Headers are used to insert <div>s that divide the document into sections and subsections. They are used in a hierarchical way so that “Header1” are the major divisions. “Header2” are subdivisions of “Header1”, and so forth.
- Paragraphs in Word are marked with a <p> element unless they are one of the following.
- Bulleted lists with style “List Bullet,lb” are marked up with <list rend=“bullet”>.
- Numbered lists with style “List Number,ln” are marked up with <list rend=“1”>. Note: The conversion process does not distinguish between different numbered lists—those beginning with 1. and those beginning with A., etc. It marks all numbered lists as 1., etc. One must change the XML markup to <list rend=“A”> to achieve a lettered list and so forth.
- Each item in a list is marked with <item>.
- Tables are marked up with <list rend=“table”>. The rows are the <item>s of the list and within each of those the columns are differentiated with <rs></rs> elements.
Note: One does not need to apply a particular style to the tables. Just insert the table through the table menu and the converter will recognize it as such. However, one must remember to include the metadata table at the beginning as the converter takes the first table in the document to be the metadata table.
- Paragraphs with style “Citation Prose,cp” will be marked up with <q> (for quote) elements.
- Paragraphs with style “Citation Verse1,cv1” and “<Citation Verse2,cv2>” will be marked with <lg> (line group) and <l> (line) elements. Each use of “Citation Verse1,cv1” will initiation the opening of a new line group (<lg>) and place that line in the first <l>. Paragraphs in “Citation Verse2,cv2” will just be placed in an <l> tag.
- Footnotes are converted to <note> elements that are inserted at the place where the footnote reference is.
Note: These are the regular footnotes entered into a document either by pressing Ctrl+Alt+f or from the Insert menu, Reference, Footnote.
- Hyperlinks are converted to <xref> elements with their n attribute set to the URL of the hyperlink and the content of the <xref> being the text of the link.
- Regular italic (Ctrl i), bold (Ctrl b) and underline (Ctrl u) are marked up with <hi> tags whose rend attribute is given as “weak”, “strong”, or “underline” respectively. (This does not always work in footnotes.)
- Character styles are converted to mark up for the following styles:
- Emphasis Weak, ew
- Emphasis Strong, es
- Date, dt
- Text Title, tt
- Lang Tibetan, tib
- Lang Sanskrit, san
- Lang Chinese, chi
- Lang Japanese, jap
- Name Personal Human, nph
- Name Personal other,npo
- Name Place, np
- Name organization, nor
- Reference, rf
For the markup applied to each of the character styles and for a quick way of differentiating indiscriminate use of italics, see the section of the Italics Conversion Macro (section 4.A) below. One should not spend too much time on applying styles for conversion, since not all styles are functional yet and often the converter does not necessarily convert all of them properly. Furthermore, particularly long documents often will cause the converter to lock up and not finish the conversion. If this is a problem, break the document into smaller parts and combine them latter as XML. The editor should instead focus on creating the structure of the document—headings, paragraphs, tables, notes, and lists—and applying basic character styles—titles, names, etc. More complicate mark up should be done in the XML editor.