A markup language is an artificial language using a set of annotations to text that give instructions regarding the structure of text or how it is to be displayed. Markup languages have been in use for centuries, and in recent years have also been used in computer typesetting and word-processing systems.
A well-known example of a markup language in use today in computing is HyperText Markup Language (HTML), one of the most used in the World Wide Web. HTML follows some of the markup conventions used in the publishing industry in the communication of printed work among authors, editors, and printers.
The details of the early history of descriptive markup languages are hotly debated. However, it is clear that the notion was independently discovered several times throughout the 70s (and possibly the late 60s), and became an important practice in the late 80s.
Some early examples of markup languages available outside the publishing industry can be found in typesetting tools on Unix systems such as troff and nroff. In these systems, formatting commands were inserted into the document text so that typesetting software could format the text according to the editor's specifications. It was a trial and error iterative process to get a document printed correctly. Availability of WYSIWYG ("what you see is what you get") publishing software supplanted much use of these languages among casual users, though serious publishing work still uses markup to specify the non-visual structure of texts.
In the early 1980s, the idea that markup should be focused on the structural aspects of a document and leave the visual presentation of that structure to the interpreter led to the creation of SGML. The language was developed by a committee chaired by Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also key members of the SGML committee.
SGML specified a syntax for including the markup in documents, as well as one for separately describing what tags were allowed, and where (the Document Type Definition (DTD) or schema). This allowed authors to create and use any markup they wished, selecting tags that made the most sense to them and were named in their own natural languages. Thus, SGML is properly a meta-language, and many particular markup languages are derived from it. From the late 80s on, most substantial new markup languages have been based on SGML system, including for example TEI and DocBook. SGML was promulgated as an International Standard by International Organization for Standardization, ISO 8879, in 1986.
SGML found wide acceptance and use in fields with very large-scale documentation requirements. However, it was generally found to be cumbersome and difficult to learn, a side effect of attempting to do too much and be too flexible. For example, SGML made end tags (or start-tags, or even both) optional in certain contexts, because it was thought that markup would be done manually by overworked support staff who would appreciate saving keystrokes.
The situation changed when Sir Tim Berners-Lee, learning of SGML from co-worker Anders Berglund and others at CERN, used SGML syntax to create HTML. HTML resembles other SGML-based tag languages, although it began as simpler than most and a formal DTD was not developed until later. DeRose argues that HTML's use of descriptive markup (and SGML in particular) was a major factor in the success of the Web, because of the flexibility and extensibility that it enabled (other factors include the notion of URLs and the free distribution of browsers). HTML is quite likely the most used markup language in the world today.
However, HTML's status as a markup language is disputed by some computer scientists. The argument for this is that HTML restricts the placement of tags, requiring them to be either fully nested inside of other tags, or the root tag of the document. Because of this, these scientists would suggest instead that HTML is a container language, following a Hierarchical model.
XML (Extensible Markup Language) is a meta markup language that is now widely used. XML was developed by the World Wide Web Consortium, in a committee created and chaired by Jon Bosak. The main purpose of XML was to simplify SGML by focusing on a particular problem — documents on the Internet. XML remains a meta-language like SGML, allowing users to create any tags needed (hence "extensible") and then describing those tags and their permitted uses.
XML adoption was helped because every XML document can be written in such a way that it is also an SGML document, and existing SGML users and software could switch to XML fairly easily. However, XML eliminated many of the more complex and human-oriented features of SGML to simplify implementation (while increasing markup size and reducing readability and editability). Other improvements rectified some SGML problems in international settings, and made it possible to parse and interpret document hierarchy even if no DTD is available.
XML was designed primarily for semi-structured environments such as documents and publications. However, it appeared to hit a sweet spot between simplicity and flexibility, and was rapidly adopted for many other uses. XML is now widely used for communicating data between applications. Like HTML, it can be described as a 'container' language.
One of the most noticeable differences between HTML and XHTML is the rule that all tags must be closed: empty HTML tags such as
must either be closed with a regular end-tag, or replaced by a special form:
(the space before the '
/' on the end tag is optional, but frequently used because it enables some pre-XML Web browsers, and SGML parsers, to accept the tag). Another is that all attribute values in tags must be quoted. Finally, all tag and attribute names must be lowercase in order to be valid; HTML, on the other hand, was case-insensitive.
<h1> Anatidae </h1>
The family <i>Anatidae</i> includes ducks, geese, and swans,
but <em>not</em> the closely-related screamers.
The codes enclosed in angle-brackets
<like this> are markup instructions (known as tags), while the text between these instructions is the actual text of the document. The codes
em are examples of structural markup, in that they describe the intended purpose or meaning of the text they include. Specifically,
h1 means "this is a first-level heading",
p means "this is a paragraph", and
em means "this is an emphasized word or phrase". A program interpreting such structural markup may apply its own rules or styles for presenting the various pieces of text, using diffent typefaces, boldness, font size, indention, colour, or other styles, as desired.
A tag such as "h1" (header level 1) might be presented in a large bold sans-serif typeface, for example, or in a monospaced (typewriter-style) document it might be underscored – or it might not change the presentation at all.
In contrast, the
i tag in HTML is an example of presentational markup; it is generally used to specify a particular characteristic of the text (in this case, the use of an italic typeface) without specifying the reason for that appearance.
The Text Encoding Initiative (TEI) has published extensive guidelines for how to encode texts of interest in the humanities and social sciences, developed through years of international cooperative work. These guidelines are used by projects encoding historical documents, the works of particular scholars, periods, or genres, and so on.