W3C Architecture Domain International Home Page

Questions & Answers: HTML, XHTML, XML and Control Codes

Question...

How do HTML, XML, and XHTML support the Control Codes in the C0 (U+0000-U+001F) and C1 (U+007F-U+009F) ranges?

Answer...

Legacy applications sometimes create data incorporating controls. It is therefore important to understand how controls are supported in markup languages, when migrating these applications or their data to the web.

There are two ranges of the Unicode Character Set that are assigned as Control Codes. The Unicode Standard makes no particular use of these controls and leaves their definition up to the application. If the application does not specify their use, then they are to be interpreted according to the semantics of ISO/IEC 6429. Most of you will recognize many of the 6429 controls: ACK, NAK, BEL, LF, FF, VT, CR, et al. The ISO 8859 family and other character standards base their control codes on the ISO 6429 standard.

The control codes in the range U+0000-U+001F are known as the "C0" range.1. This range begins with the NUL U+0000 control. The control codes in the range U+0080-U+009F are known as the "C1" range.2. Delete U+007F is also a control and is adjacent to the beginning of the C1 Range.

A few points are worth noting about controls and markup:

When control codes represent other kinds of text data, (not formatting or binary data), it can be important to maintain their values in context. However, the display of most of the controls by browsers is behavior that is unspecified. Maintenance of control codes in text is generally more important for data interchange. Programmers working with legacy applications that may have data in the C0 range should be aware of which markup languages support the range.

The following table summarizes which markup languages support the control codes:

Controls and Markup Language
ControlsRangeHTML 4XML 1.0XHTMLXML 1.1
C0, except TAB, LF, CRU+0000 (NUL)IllegalIllegalIllegalIllegal
U+0001-U+001FIllegalIllegalIllegalNCR, CER4
DELETE + C1U+007F-U+009FSupportedSupportedSupportedNCR, CER4

Solutions

If you need to represent the C0 controls in HTML, XML 1.0 or XHTML, you can create a convention to represent them and replace every occurence with that convention. An alternative is to encode the data. For example, encode the data as base64 or as hexadecimal values, to ensure only supported characters are used in the markup language text. (And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.

Another alternative is to store the data in an external document and reference it from the XML document.

In XML 1.1, the simplest alternative is to represent any occurence of a control with an NCR. For example, the control code "ESCAPE" U+001B would be represented by either the  (hexadecimal) or  (decimal) Numeric Character References.

NOTES:

1 More details on the C0 range are available in the Unicode Code Chart: C0 Controls and Basic Latin.

2 More details on the C1 range are available in the Unicode Code Chart: C1 Controls and Latin-1 Supplement.

3 The document Unicode in XML and other Markup Languages contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML.

4 Character Entity Reference (CER) is a term defined in the HTML standard for a named Entity that contains a single character. For example, eacute is the Character Entity Reference which represents "é". These Character Entity References are predefined and so available to all HTML files. XML does not use the term Character Entity References, but we use the term here to refer to an Entity, that you might define, to represent characters that may be controls.


Authored by Tex Texin and François Yergeau

Version: $Id: qa-forms-utf-8.html,v 1.15 2003/05/12 11:12:20 duerst Exp $