Introduction to XML

XML is a programming language very similar in principle to HTML. It is a markup language, which means that the transcription document is "marked up" with code written into the body of the document.

XML uses code "tags" to define sections of text that are to be handled by the software in various ways. These tags are distinguished by angle brackets and contain "elements" or phrases that define exactly what the tag's function will be.

Tags need to be closed to define the extent of their impact on the text, and must be closed in the sequence in which they are opened. Since many tags exist as "child" tags or subdivisions, nested within other tags, this means that order of opening and closing is especially important. This is accomplished by having a closing tag at the end of the section of text being described, or using a "/" at the end of the initial tag. See examples below.

There can be multiple elements or attributes contained within a single tag. Elements may be as simple as <p> to create a paragraph, or as elaborate as <div type="X"> where the element "type" has the potential to have a number of different attributes, all dependent on this tag's opening element "div". In this tutorial, elements with multiple possible values are described with an @ sign, so @type would refer to a tag containing the attribute <type=""> where the value would be contained within the quotation marks.

One attribute that will recur frequently in an XML document is called @xml:id (unique identifier). This attribute is crucial for the software to function correctly. It allows invidivual canons, notes, prefaces, and creeds to be identified specifically and uniquely, and this permits a search function to locate and display the particular item.

Unique identifiers are assigned to the following tags: <text type="collection">, <text type="register">, <text type="book">, <text type="part">, <text type="title">, <text type="preface">, <text type="creed">, <text type="council">, <text type="decretal">, <text type="paracanonical">, <text type="floatingText">, <div type="canon">, <div type="regCanon">, <seg>, and <note>. As a general rule, the @xml:id will be formed in the following way for all <div> and <text> tags: [siglum of the manuscript.folio number.listing number]. In other words, the siglum of the manuscript, the folio number, and the listing number of the object on the page (third from the top, eighth from the top, seventeenth from the top, etc) are all required to create the @xml:id for that object. For the <note> and <seg> tags, an -n (-1, -5, etc.) should be added to the end as they are dependent on the xml:id that has been created for the portion of text they are referencing.

See below for examples of xml:ids that have been created in the usual way. A slightly more complex use of the @xml:id is when notes are attached to lemmata in the text. See here for further discussion.

If you want to write something within an XML document that won't be visible to the search engine and will not contain any active code, text can be placed within the following code: <!--comment goes here -->. This will create a section of text that will not be visible to search programs, and can be used to make comments about the manuscript, notes or suggestions for the editor or reader, or any other purpose.

There are XML-editing programs, such as oXygen, that can assist in writing XML code for an existing transcription. An XML editor will allow a novice coder to write faster and more accurate code, as it will predict the next needed tag, provide error-checking, and can also structure a document for more intuitive comprehension. However, XML can be written in any plain-text editor, such as Notepad.

XML must validate to a schema which sets the parameters for correct code. The CCL schema is located at http://ccl.rch.uky.edu/XML/CCL.rng. This can be entered into your XML-editing software program which will perform the validation steps for you. If you are writing in a plain-text editor, the validation process will be much more complex. Please contact the CCL in this case for more information.