THL Toolbox > Essays > Creating an XML Essay for THL
Contributor(s): David Germano, Nathaniel Grove, Steven Weinberger.
This document describes the process for creating an XML essay for display in the Tibetan & Himalayan Library. Such an essay can be either a Scholar's essay on a subject in the field or a technical essay such as the present document. In either case, the basic process is the same and, at least for the time being, the styles used for displaying the two are the same. In the latest version of the converter, the process has been significantly simplified.
The process of creating an XML document for THL has been simplified by using Microsoft Word and glossary tables. One first extracts all glossary items (personal names, place names, text titles, terms, etc.) from the essay and places them with their associated information (phonetic, other language translations, dates, item type, etc.) in the glossary table. The converter uses the glossary table to search the essay and apply the appropriate Word style to each occurrence of the item. It then goes through and converts the Word document to XML based on the Word styles applied. Finally, it applies the glossary information to the essay creating correspondences between each occurrence of an item and its glossary entry and also including supplementary information (Wylie for terms, dates for people, etc.) on the first occurrence of that term. The results are two XML files one for the essay and one for the glossary.
The converter will also convert an essay without a glossary by simply converting Word styles applied in the essay itself into XML. It will also convert a Tibetan text (as long as it is entered in Unicode) into XML with or without a glossary.
The whole process consists of the following steps:
- Setup an XML editor and download the THL Word styles. This step is described at Configuring Your Computer.
- Download fonts and template files. See MS Word Template Files and Unicode Font.
- Copy the essay into a document that has THL Word styles (unless it already has the THL Styles associated with it) and apply basic paragraph, or structural, styles. See Applying the First Word Styles and the page on Essay Structure. One should also apply paragraph level styles for things such as verses, citations, lists, and the like. For full information on THL's XML mark-up scheme, see our XML markup manual.
- Create the Glossary described at the page on the glossary table. Or apply the Word styles to a document without a glossary described at Using Word styles to create an XML document.
- Install the converter, described below.
- Convert the Word document to XML, described below. Address any problems with the conversion as described below.
- Edit the resulting XML.
For a description of how the converter works, see the section below.
The conversion routine is written in Visual Basic for Applications and is included within a word document, called “thl-word2xml-conv-v1.3.doc”. It will only work on Windows machines. It is necessary to obtain this file and a few other supplementary ones for the conversion to run successfully. To set up one’s machine properly for conversion, do the following (This is described for version 1.3. The version number may change if the converter is updated.):
- Download the WordToXMLConverterLatest from THL website.
- Extract the whole .zip file contents wherever you want on your computer. This is done either by opening the .zip file and pressing the Extract button or right clicking on the .zip file and choosing "Extract to here …" option. It will create a folder called "w2xmlconv-v1.3".
- Delete the files called "Place-Holder-DELETE.txt" in the /indocs/ and the /outdocs/ folders.
At this point, you should be ready to converter a document created with THL Word styles, as follows. The following instructions are for the new converter (v1.3 and above). For previous versions of the converter, see Old Instructions on Converting Word To XML.
- Note: if you are using the latest Java 1.7 (aka Java 7), then you CANNOT be connected to UVaAnywhere or any Cisco Anywhere VPN as this will cause a socket error. See http://www.java.net/node/703177
- Place the Word doc and glossary table, if there is one, in the /w2xmlconv/indocs/ folder.
- Open the file thl-word2xml-conv-v1.3.docm (it's in the folder w2xmlconv-v1.3)
- Make sure no other Word documents except this one are open.
- Press alt+c to begin the conversion process.
- If a dialog box opens saying that macros are disabled, you need to enable macros
- For versions of Word before 2007: go to Tools > Macro > Security and then select Medium
- For Word 2007 and later: click on the Office icon in the top left corner of the document
- Click on Word Options at the bottom of the page that opens
- Click on Trust Center in the lefthand column
- Click the Trust Center Settings button
- Click on Macro Settings in the lefthand column
- Click in the button to the left of Enable all macros
- Save the file, close it, and reopen it
- A dialog box will appear asking you to choose the document to convert. Choose the document you want to convert. A second custom dialog box will then open.
- Fill out the second dialog box in the following way (Image of Converter Dialog Box):
- Your ID: enter either your THL numeric ID, if you have one, or your three initials.
- JIATS Essay? Check this box if it is a JIATS essay
- JIATS Issue Number: This will only appear if the box is checked. Enter in the issue number of the article being converted.
- Author Name: Enter the author’s last name in lower case, if applicable
- Essay Title: Enter in the essay title in lower case. This is not used for JIATS articles.
- XML File Name: This display box automatically updates to show you the XML file name of the converted document as you modify the above information.
- Has Glossary? Check this box if the essay has a glossary.
- Press the Convert! button. (To cancel press escape while the dialog box is showing or click the close button at the top right of the dialog box.)
- The converter will then ask you to locate the necessary documents and will prompt you to confirm if you are writing over the resulting XML files. At any of these windows you can cancel the conversion process.
- If duplicate entries are found in the glossary table, a dialog will ask which you want to keep. The conversion can also be cancelled at this point.
- The rest of the conversion process is automatic and cannot be cancelled. You will see a MS-DOS window open as the information from the glossary is applied to the essay. It will tell you when the conversion is complete and say “Press any key to continue” (see the Image of DOS Window for a Successful Conversion) unless the initial XML document is not valid, in which case it will inform you there’s been an error.
- At the end, the resulting XML document(s) will be found in the /w2xmlconv/outdocs folder. In cases, where a glossary was used, two versions of both the essay and the glossary file will be found. The glossary will have the same name as the essay file with “-gloss.xml” appended to it (see the Image of DOS Window for an Unsuccessful Conversion).
If the resulting XML document is not valid, the transformation to apply the glossary entry information will not work. In such cases, an error statement will appear in the MS-DOS window that monitors the XSLT transformation. One will then have to find the XML document output by the converter. This will be located in the /outdocs/ folder within the converter under the name originally specified. The last part of the conversion process, which applies the glossary information to the essay using XSLT, has to be redone. The whole process is:
- Open up the output XML document in Oxygen and fix any validation errors that are found. Save and close it.
- From this converter document, press alt+x. A message page will appear saying you’ve initiated the application of XSLT transformations to apply glossary information to an essay.
- Press continue.
- A file dialog window will appear. Choose the corrected XML essay file.
- Another file dialog will appear. Choose the XML glossary table file.
- The rest of the process will proceed on its own in an MS-DOS window as above, informing you either the “Conversion is complete” or “There has been an error.” If there is an error, re-check the original XML document to make sure there no validation errors.
- Resulting XML documents will be found under the same name as the original documents in /w2xmlconv/outdocs folder.
- Applying THL Word Styles to an Essay: Pasting the text into a Word document and applying THL styles.
- Catalog Record for THL Essays: Adding and filling out metadata table.
- Creating a Glossary Table
- Converting THL Essay to XML: Converting the Word document to an XML document
- Opening THL Essay in XML Editor. Cleaning it up and adding markup as necessary to achieve a valid XML document.
- Posting THL XML Essay: Posting the document to the THL site and linking to it.
Each of these steps are described in further detail in the linked sections.
Note: These instructions are written for PCs running Windows (preferably XP but it may work on earlier versions) only.
First you create the Word document in a THL template with the appropriate header and structural styles applied to it. Then you create a separate document with the glossary table in it. Then the converter is a stand-alone third document which contains a visual basic macro and also uses a Java virtual machine to do things like put in the information after the first occurrence of a world and sort the glossary in Tibetan sort order – but it runs this through a Word window, though you must have Java installed on your machine. You press Alt + C, and it asks you to specify the essay document, and you choose it from your hard drive using a “choose file” dialog box. Then it asks for your initial, and whether it has a glossary file. If you have one, it asks you choose the glossary file from your hard drive using a “choose file” dialog box.
It looks at the glossary and starts with the shortest term (so will start with sangs rather than sangs rgyas); otherwise for terms of the same length, it simply starts with the first term it encounters in terms of its location within the file. Then it searches through the essay and finds all occurrences of that text string. Then it uses the “type” of term to determine what Word character style to apply to that term in Word. Once it has done that for each term in the Word glossary, it then converts the Word styles into XML.
- In bibliographic citations, the correct style to use for the pagination (if used at all) is "pages,pgs" and NOT "PageNumber,pgn". The latter is used to insert a <milestone unit="page"/> to denote a page break in a transcription and causes validation errors.
- Author styles all now get converted into <author>. This causes some validation problems because all <author> elements HAVE to be included within a <bibl> element. To that end, the converter now wraps the content of all <note> elements with <bibl>. Thus, it will always be <note><bibl>…</bibl></note> for all notes. But, in notes that do not have bibl information, the <bibl> … </bibl> tags should be removed.
- <author> elements that are included within the body of the text can be dealt with in two different ways:
- If the author is included with at least a title, then all the associated bibliographic information can be wrapped in a <bibl> element.
- If the author is just named and the bibliographic information is in a note, then it is permissible to change it to <persName type="author">.