THL Toolbox > Tibetan Texts > Batch Conversion of Input Tibetan Texts
Contributor(s): Than Grove
Input Tibetan texts can be converted from Word into XML using the latest WordToXML converter. The conversion is done in batch format that will convert several documents (usually a volume's worth) at once. The conversion process is relatively simple. However, the process of converting a whole volume can take a good amount of time. The process assumes that one has a volume of input and partially marked up texts. The mark up can be minimal. The requirements for the input texts are:
- They are input in Unicode Tibetan
- They each have a Tibetan Text Metadata table at the top of the document.
- They each have at least a h1 header with the Tibetan title in it
- They have page marker milestones in the form of “[1a]” for pages with the Word style, “PageNumber, pgn” applied to each page number
Other more specific mark-up is not required, but is helpful and therefore recommended. This is covered elsewhere in this documentation.
The process for converting Tibetan texts is as follows:
- Download the latest WordToXML converter
- Unzip it to a convenient place on your hard drive. This will create a folder called “w2xmlconv-v1.3.2”, though the version number may be different.
- This folder will have three folders in it “in”, “out”, and “lib”. In the “in” folder, place all the Word documents of the input Tibetan texts that you want converted. It is recommended that you do no more than one volumes worth of texts. For volumes with more than thirty texts, even these may need to be broken into different batches.
- Once all the Word docs are in the “in”, open the document called “thl-word2xml-conv-v1.3.2.doc”.
- Make sure Macros are activated and allowed.
- Press Alt + B (for batch). The converter will then run on its own and convert all the documents in the “in” folder.
- After conversion, open the “out” folder and check each XML document by opening it in Oxygen and making sure that it validates, fixing any errors discovered.
- In order to view a text in a single XML file that has been concatenated from multiple Word docs, in Windows you need to set jEdit global encoding option to UTF-8Y
If a page number milestone is within another character style (this does not apply to paragraph styles), such as title or name person, etc., the element representing the surrounding character style will be repeated after the milestone. This will result in invalid markup that looks something like this:
བོད་སྐད་དུ། <title lang="tib" level="m">འཇམ་དཔལ་ཡེ་ཤེས་སེམས་དཔའི་དོན་དམ་པའི་ མཚན་ཡང་དག་པར་བརྗོད་<milestone unit="line" n="1a.2"/><title level="m">པ</title>༑
Here, the open <title> tag is erroneously repeated after the <milestone> tag. This causes the document to be invalid and in oXygen you will received the following error message:
The element type "title" must be terminated by the matching end-tag "</title>".
Double clicking on the error message at the bottom of the oXygen window will take you to the vicinity of the fault. The second open <title> tag needs to be deleted for the document to validate. The correct markup would look as follows:
བོད་སྐད་དུ། <title lang="tib" level="m">འཇམ་དཔལ་ཡེ་ཤེས་སེམས་དཔའི་དོན་དམ་པའི་ མཚན་ཡང་དག་པར་བརྗོད་<milestone unit="line" n="1a.2"/>པ</title>༑