O Markup Language

This document specifies the O Markup Language (OML). The recommended file extension is .oml. Documents may be written in any character encoding that includes the ASCII symbol characters described below. Implementations are recommended to support at least UTF-8.

A represented document consists of a sequence of nodes. A node is either text or an element. An element has a label and a sequence of zero or more child nodes.

The paired ASCII symbol characters (<[{ (and }]>)) are called left (right) beaks. A sequence of one or two of the remaining 24 ASCII symbol characters !"#$%&'*+,-./:;=?@\^_`|~ is called an eye. A two-character eye takes precedence over a one-character eye. A sequence of one or more ASCII whitespace characters --- namely horizontal tab (HT, 0x09), line feed (LF, 0x0A), vertical tab (VT, 0x0B), form feed (FF, 0x0C), carriage return (CR, 0x0D), and space (SP, 0x20) --- is called a cheek. Note that all null characters in the input are removed prior to parsing.

Parsing proceeds sequentially. Throughout the following, anything described as "potential" becomes text if it does not ultimately become an element. An occurrence of a beak, an eye, and an optional cheek is a potential left head. An occurrence of an optional cheek, an eye, and a beak attempts to create an element as described below. All other types of occurrences are text.

Some left heads have a corresponding label. The set of such mappings is called the vocabulary. A special element that modifies the vocabulary is called a vocabulary change. Its default left head is <!. A left head carries its label if one is assigned by the vocabulary.

An element is created when the nearest preceding left head in the sequence exists, has the same kind of beak, and has a matching eye. Eyes match when the number of eyes is the same, and either there is one eye and it is the same as the left head's eye, or there are two eyes and the left (right) eye is the same as the left head's right (left) eye. When the condition is met, for the sequence between the left head and the closing eye: if the left head's label denotes a vocabulary change, the processing described below is performed; if the left head's label is non-empty, an element with that sequence as its children is created; and if the left head's label is empty and it is a direct child of a potential vocabulary change, it becomes a potential element.

In processing a vocabulary change, for each item in its content sequence, the following operation is performed at most once where applicable: if the item is an element, its content sequence is examined; if that content is text, the text becomes the label mapped to the element; if it is a left head, the mapping for that left head is transferred to the element's left head.

The language is now fully specified. The remainder of this document is non-normative.

Features and Notes

This section briefly outlines the features and notable points that follow from the specification above. This language focuses on ease of memorizing the syntax, ease of implementation, and extensibility. It also aims to be independent of the natural language and text encoding used for authoring. Consequently, for the commonly used UTF-8, an implementation providing very basic functionality should not be difficult even in C. There are no escape characters such as those found in other markup languages. For example, suppose you want to write the default head of a vocabulary change, <!, literally. This can be neutralized by first using a vocabulary change to transfer the vocabulary-change label to a different head (as shown in the examples below). Any string can be parsed as a valid document --- in other words, there are no syntax errors. No vocabulary is predefined except for the vocabulary change itself, which is central to the syntax. Elements such as headings, emphasis, and lists are expected to be defined by the user, who then processes the parsed result as needed.

Formalization of Lexical and Syntactic Structure

This section provides a rough formalization of part of the lexical and syntactic structure. Note that vocabulary-related resolution and the specific processing of vocabulary changes cannot be expressed by formal grammar alone and are therefore not included here. Accordingly, be aware that some results parsed according to the Extended Backus–Naur Form (EBNF) below may ultimately be treated as text.

Document  ::= Node*
Node      ::= TEXT | Element

LBeak     ::= "(" | "<" | "[" | "{"
RBeak     ::= ")" | ">" | "]" | "}"
EyeChar   ::= "!"  | '"' | "#" | "$" | "%" | "&"
            | "'"  | "*" | "+" | "," | "-" | "."
            | "/"  | ":" | ";" | "=" | "?" | "@"
            | "\\" | "^" | "_" | "`" | "|" | "~"
Eye       ::= EyeChar | EyeChar EyeChar
Cheek     ::= (HT | LF | VT | FF | CR | SP)+

Element   ::= LeftHead Content RightHead
LeftHead  ::= LBeak Eye [Cheek]
RightHead ::= [Cheek] Eye RBeak
Content   ::= Node*

Implementation Guidelines

The following is an example algorithm for implementing a parser. Any approach that conforms to the normative content above is acceptable, even if it differs from what is described here. The reference implementation handles the stack somewhat differently, but the overall flow is the same. The general approach is stack-based sequential parsing --- that is, a scan from the beginning to the end of the input string.

The input is preprocessed. Null characters are removed. For Unicode input, normalization is recommended. Line endings are recommended to be normalized to LF, though leaving them as CRLF causes no practical issues. The byte order mark (BOM) is recommended to be removed. An input that is empty at this point represents an empty document.

Lexical analysis is performed. At each position during the scan, the longest match is used to produce a token of the following kinds: LBeak, RBeak, Eye (preferring length 2), Cheek, and TEXT for any sequence of other characters.

Parsing is performed. An empty stack is prepared to hold left head candidates. The token sequence is read from the beginning and processed as follows.

When an LBeak appears at some position in the token sequence and is immediately followed by an Eye, that position is pushed onto the stack as a potential left head (LeftHead). The information stored is the kind of LBeak, the Eye string, and the position of the left head. Note that a Cheek may follow, and that it becomes part of the left head once the left head is confirmed.

When an RBeak appears immediately after an Eye at some position in the token sequence, the stack is searched from the top --- that is, from the most recent left head --- for the first left head whose beak kind forms a matching pair and whose eye matches. Processing then proceeds as specified above. For example, if an element is to be created, the cheek is excluded, the token sequence up to just before this closing eye is taken as Content, and the resulting child node sequence becomes the element's children. The matching left head is then removed from the stack. If no matching left head is found, the Eye and RBeak are treated as text and accumulated as TEXT.

Once parsing is complete, any potential left heads remaining on the stack are treated as unclosed and are finally converted to text. Note that vocabulary changes must be accounted for separately in the processing described above, as they may cause certain items to be recognized as elements or to cease being recognized as such.

Examples

This section presents examples. Each example is accompanied by its JSON output, enabling conformance testing against this document. The JSON output follows the JSON schema below, which represents the abstract syntax tree. This is the format produced by the reference implementation's executable, though conforming implementations are not required to produce this exact structure.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "urn:local:oml-ast.schema.json",
  "title": "Sample AST",
  "type": "array",
  "items": { "$ref": "#/$defs/node" },
  "$defs": {
    "node": {
      "anyOf": [
        { "type": "string" },
        { "$ref": "#/$defs/element" }
      ]
    },
    "element": {
      "type": "object",
      "properties": {
        "label": { "type": "string" },
        "children": {
          "type": "array",
          "items": { "$ref": "#/$defs/node" }
        }
      },
      "required": ["label", "children"],
      "additionalProperties": false
    }
  }
}

Any string is a valid source.

Case 1:
a
["a"]

A copyright symbol is not an element, because it contains no eye. It remains a plain string as-is.

Case 2:
(C)
["(C)"]

If a label is not in the vocabulary, no element is created.

Case 3:
(+a+)
["(+a+)"]

The minimal vocabulary change is as follows.

Case 4:
<!(*a*)!>(*b*)
[{"label":"a","children":["b"]}]

When there are two eyes, it is as follows.

Case 5:
<!(:~a~:)!>(:~b~:)
[{"label":"a","children":["b"]}]

When a mapping is moved by a vocabulary change, the original mapping is removed.

Case 6:
<!(+a+)(* (+ *)!>(+b+)(*c*)
["(+b+)",{"label":"a","children":["c"]}]

If no element is created, any part that could have been a cheek is preserved.

Case 7:
(+ (* +)
["(+ (* +)"]

If an element is created, the cheek is removed.

Case 8:
<!(+a+)!>(+ (* +)
[{"label":"a","children":["(*"]}]

A vocabulary change can itself be changed to another element.

Case 9:
<! <? <! ?> !><? (+a+) ?><!a!>(+b+)
["<!a!>",{"label":"a","children":["b"]}]

Changelog

The first edition was written on May 21, 2023. The design was inspired by TeX, XML, and Djot. On April 27, 2025, the acronym of this language was changed to OML. On May 17, 2026, the eighth edition added the handling of whitespace.

License

Copyright (C) 2023-2026 gemmaro.

Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.  This file is offered as-is,
without any warranty.