Requirements and normalization of domain names in input
Table of contents
- Objective
- Overview
- Scope
- Inputs
- Summary
- Test procedure
- Outcome(s)
- Special procedural requirements
- Detailed requirements
- Terminology
Objective
This specification defines the requirements for zone name to be tested. The same requirements are put on name server names in the input, if any. If the requirements are not met, then Zonemaster will not start any tests.
This specification also defines some normalization that the domain names (zone name and name server name) will go through. If a domain name is normalized it means that an updated form of the name will be used. The updated form is considered to be equal in meaning.
In order to execute the tests of the zone name from the input it must be a valid domain name. If name servers are provided for the zone in the input, the names of the name servers must also be valid domain names. Both types of domain names, zone names and name server names, are tested and normalized by this test case. The zone name is called Child Zone in Zonemaster test case specifications.
Overview
To be valid, Domain Name must be one of two:
- a valid ASCII domain name, or
- a valid IDN name (Internationalized Domain Name) as of IDNA2008.
The process defined in this specification will normalize Domain Name and output a normalized form to be used by all Zonemaster test cases. The objectives of the normalization are
- Remove leading and trailing white space characters.
- Convert other dot characters to regular dot (or "FULL STOP").
- Create legal IDNA 2008 U-labels from convenient alternative forms.
- Create consistent representation of the same zone name.
The result of the normalization can be a new form of Domain Name to be used by the tests in test cases, the normalized form. If the normalized form is neither a valid ASCII domain name nor a valid IDN name, then Domain Name cannot be used for Zonemaster testing.
If the outcome (see Outcome(s)) is not "fail" then Domain Name in normalized form is returned to be used as input value for Zonemaster test cases.
See the details in the Detailed requirements section below.
References
The following references are consulted for this specification:
Scope
This specification only tests and creates a normalized form of the domain name (zone name or name server name).
In this specification, ASCII is identical to the first 128 characters in Unicode (0000..007F).
RFC 1123, section 2.1, specifies that a domain name label may not start or end with a HYPHEN-MINUS ("-"), only digit or letter. This restriction on HYPHEN-MINUS is disregarded in this specification and is assumed to be handled in test case Syntax02.
The use of the SOLIDUS ("/") and the LOW LINE ("_") in domain name is discussed in the section "ASCII domain name" below. Any restrictions on where in the domain name or label those could or should be used are disregarded in this specification, and are assumed to be handled in test cases Syntax01 and Syntax02.
Inputs
- "Domain Name" - The domain name to be tested and normalized according to this specification. It must be a non-empty string of Unicode characters.
Summary
In the specification there are six scenarios that will result in the domain name not being usable, i.e. it cannot be used for Zonemaster testing. Each scenario is here listed with a message tag, level (always CRITICAL in this specification), suitable argument to be used in the same descriptive message and a message that can be returned to the user.
Message Tag | Level | Arguments | Message ID for message tag |
---|---|---|---|
AMBIGUOUS_DOWNCASING | CRITICAL | unicode_name | Ambiguous downcasing of character "{unicode_name}" in the domain name. Use all lower case instead. |
DOMAIN_NAME_TOO_LONG | CRITICAL | Domain name is too long (more than 253 characters with no final dot). | |
EMPTY_DOMAIN_NAME | CRITICAL | Domain name is empty. | |
INITIAL_DOT | CRITICAL | Domain name starts with dot. | |
INVALID_ASCII | CRITICAL | label | Domain name has an ASCII label ("{label}") with a character not permitted. |
INVALID_U_LABEL | CRITICAL | label | Domain name has a non-ASCII label ("{label}") which is not a valid U-label. |
LABEL_TOO_LONG | CRITICAL | label | Domain name has a label that is too long (more than 63 characters), "{label}". |
REPEATED_DOTS | CRITICAL | Domain name has repeated dots. |
The value in the Level column is the default severity level of the message. Also see the Severity Level Definitions document.
The argument names in the Arguments column lists the arguments used in the message. The argument names are defined in the argument list.
Test procedure
Tables 1, 2, 3 and 4 are found in the Detailed requirements section below.
-
Create the following sets
- Set of permitted ASCII characters in Table 1 below ("Valid ASCII").
- Set of Unicode white space characters in Table 3 below ("White Space")
- Set of Unicode full stops (dot characters) in Table 4 below ("Unicode Full Stops").
-
If Domain Name starts with one or more of White Space then those are removed from Domain Name before further processing.
-
If Domain Name ends with one or more of White Space then those are removed from Domain Name before further processing.
-
If Domain Name is an empty string then output EMPTY_DOMAIN_NAME and terminate these test procedures.
-
If Domain Name contains LATIN CAPITAL LETTER I WITH DOT ABOVE then:
- Output AMBIGUOUS_DOWNCASING and the Unicode name of the code point in question.
- Terminate these test procedures.
-
Create an empty, ordered list of labels ("Domain Labels").
-
Replace all instances of character from Unicode Full Stops in Domain Name with the label separating, regular dot U+002E (see Table 2).
-
If Domain Name is the root zone, i.e. the exact string "." (U+002E), then terminate these test procedures with no message tags.
-
If Domain Name starts with dot (".", U+002E) then output INITIAL_DOT and terminate these test procedures.
-
If Domain Name has any instance of two or more consecutive dots (".", U+002E) then output REPEATED_DOTS and terminate these test procedures.
-
Remove trailing dot (".", U+002E) from Domain Name.
-
Split Domain Name into labels by dot "." (U+002E) and put them in the same order in Domain Labels.
-
For each "Label" in Domain Labels do:
- If all characters in Label are ASCII characters, then do:
- If any character in Label is not listed in Valid ASCII, then output INVALID_ASCII and Label, and terminate these test procedures.
- Else, downcase all upper case characters as specified in section "Upper case" below.
- Else do:
- Assume that Label is a U-label.
- Downcase all upper case characters as specified in section "Upper case" below.
- Normalize Label to NFC as specified in Unicode TR 15. Also see section "Unicode normalization" below.
- Convert Label to an A-label as specified by IDNA2008.
- If the conversion failed, then output INVALID_U_LABEL and Label, and terminate these test procedures.
- Else, replace the U-label in Domain Labels with the A-label from the conversion above.
- Go to next label.
- If all characters in Label are ASCII characters, then do:
-
For each "Label" in Domain Labels do:
- If the length (number of characters) in Label is greater than 63 then output LABEL_TOO_LONG and Label, and terminate these test procedures.
-
Map the labels in Domain Labels back into Domain Name with one dot (".", U+002E), between the labels (no dots if the there is only one label).
-
If the length of Domain Name is longer than 253 characters including the dots, then output DOMAIN_NAME_TOO_LONG and terminate these test procedures.
Outcome(s)
The outcome of the tests in this specification consists of three parts
- The outcome value as defined below in this section.
- The message tags, if any, and data connected to the message tags, if any.
- Domain Name in the normalized form to be used as input value for all test cases. If the outcome value is "fail" then no Domain Name is returned.
The outcome value of this specification is "fail" if there is at least one message outputted. In other cases it is "pass".
Special procedural requirements
The tests and normalizations defined in this specification must always be run and evaluated before any Zonemaster test case is run.
If the outcome from this specification is "fail", then no test cases should be run.
Detailed requirements
This section describes the requirements on the domain name. Besides ensuring that the domain name is valid, these requirements also ensure that the domain name is used in a normalized form.
ASCII domain name
An ASCII domain name is valid if it follows the rules defined in RFC 1123, section 2.1, i.e. only consists of the ASCII characters "a-z", "A-Z", "0-9", "." and "-" with the extension of the following two characters:
- The LOW LINE (underscore, "_") character standardized for e.g. SRV records (RFC 2782) and other record types and special names.
- The SOLIDUS (forward slash, "/") used in reverse zone names for IPv4 networks smaller than /24. See examples in RFC 2317, section 4.
In ASCII names, upper case A-Z are treated as equal to a-z (RFC 1034, section 3.1 and RFC 1035, section 2.3.3). The regular dot, or FULL STOP ("."), is used as label separator (RFC 1034, section 3.1). Also see Table 2 below.
Table 1: A summary of the valid ASCII characters in labels using Unicode codes.
Unicode code or code range | Character or character range | Comment |
---|---|---|
0061..007A | a-z | |
0041..005A | A-Z | Upper case of a-z |
0030..0039 | 0-9 | |
U+002D | - | HYPHEN-MINUS |
U+002F | / | SOLIDUS (forward slash) |
U+005F | _ | LOW LINE (underscore) |
Table 2: A summary of the valid ASCII character between labels using Unicode codes.
Unicode code | Character | Comment |
---|---|---|
U+002E | . | FULL STOP (in this document referred to as "dot") |
The fact that "." (U+002E) character is the delimiter between labels puts some limitations on its use. The first label cannot be en empty label unless that is the only label, i.e. the root domain name. With that exception (covered below) a domain name cannot have a "." (dot) initially. Only the last label can be an empty label (the root label), which means that there cannot be two or more consecutive "." (dots) in a valid domain name. The domain name, as entered to Zonemaster, can either have a final dot or not, and will be normalized as described below.
IDN name
A valid IDN name is a domain named where one or more labels are valid IDN label (RFC 5890) and the remaining labels are valid ASCII labels as defined above. An IDN label can be an A-label or a U-label (RFC 5890, section 2.3.2.1).
-
A valid IDN name where all IDN labels are A-labels will automatically meet the ASCII name requirements above given that the non-IDN labels meet them.
-
A valid IDN name with one or more U-labels can be converted to a valid IDN name where all IDN labels are A-labels.
A valid ASCII name is, by definition, encoded in ASCII. A valid IDN name must either be encoded in ASCII (no U-labels) or in UTF-8 (at least one U-label). If not, Zonemaster will not be able to process the domain name. Note that ASCII is a subset of UTF-8.
A valid ASCII name consists, by definition, of only ASCII characters. A valid IDN name must either consists of only ASCII characters (no U-labels, only A-labels) or consist of at least one non-ASCII Unicode character in at least one label, i.e. at least one U-label. U-labels and A-labels can be mixed, and IDN labels can be mixed with non-IDN labels.
Length limitations
There is a maximum length for the whole domain name and a maximum length for each label. These limitations are defined for a domain name of ASCII characters only, which means that any IDN U-label must be converted to the equivalent A-label before the limitations can be checked.
The maximum total length of a domain name is 253 characters (or octets) if it has no final dot, 254 with the final dot (RFC 1035, section 2.3.4). Note that he RFC defines the limit as 255 octets, but that is the limitation in the DNS packet, where labels separation is done differently.
The maximum length of a label is 63 characters (or octets), RFC 1035, section 2.3.4. A label must be at least one character (octet) long unless it is the label representing the root domain name, which is zero in length and always after the final dot.
Root zone
If the root zone is to be tested, then it must be represented as a single dot "." and in no other way. The label that represents the root zone is an empty label after the dot.
Creating IDNA2008 compatible format
For a discussion on pre-processing the domain name to achieve IDNA compatible U-label from convenient alternative forms see RFC 5895. Unicode normalization is covered by RFC 5891 and Unicode TR 15
Unicode normalization
For Unicode strings normalization processes have been defined to make convert different representations into a normalized form. Specifically, it is required that an IDN label (IDNA2008) is in the so called "Normalized Form C" (NFC) as of RFC 5891, section 5.2.
For ASCII domain names NFC is no issue since they are always in NFC format. For an IDN name the situation is different. The letter "ö" in the IDN domain name "malmö.se" can be represented as either the single Unicode code point U+00F6 or as the Unicode code point sequence "006F 0308". Only the former is in NFC form, which means that if the domain name is entered with the sequence it must be preprocessed before entering IDNA2008 processing, i.e. conversion to A-label format. See Unicode TR 15 for a specification of Unicode normalization and more examples relevant to domain names.
Zonemaster (this specification) requires that any domain name must be converted to NFC form before conversion to A-label. However, the domain name is entered in A-label format, this specification does not require that the corresponding U-label is in NFC format.
White space
In the user interface there is a risk that leading or trailing white space characters are added to the domain name by mistake. The domain name will in this specification be normalized by removing such characters. In Table 3 it is specified what counts as white space characters. It should be pointed out that white space characters within the domain name are not removed, and in the end count as invalid characters.
Table 3: White space characters*
Unicode code | Name |
---|---|
U+0020 | SPACE |
U+0009 | CHARACTER TABULATION |
U+00A0 | NO-BREAK SPACE |
U+2000 | EN QUAD |
U+2001 | EM QUAD |
U+2002 | EN SPACE |
U+2003 | EM SPACE |
U+2004 | THREE-PER-EM SPACE |
U+2005 | FOUR-PER-EM SPACE |
U+2006 | SIX-PER-EM SPACE |
U+2007 | FIGURE SPACE |
U+2008 | PUNCTUATION SPACE |
U+2009 | THIN SPACE |
U+200A | HAIR SPACE |
U+205F | MEDIUM MATHEMATICAL SPACE |
U+3000 | IDEOGRAPHIC SPACE |
U+1680 | OGHAM SPACE MARK |
Full stop
The regular dot "." expected in domain names is a U+002E (FULL STOP), see Table 2 above. There are other characters that may be entered instead due to the script setting. Table 4 lists full stop characters that are to be mapped into the ASCII FULL STOP (Unicode TR 46, section 2.3). That mapping must be done before any verification or checks of the dot and before splitting Domain Name into labels.
Table 4: Non-ASCII dots (Full Stops) using Unicode codes
Unicode code | Character | Name |
---|---|---|
U+FF0E | . | FULLWIDTH FULL STOP |
U+3002 | 。 | IDEOGRAPHIC FULL STOP |
U+FF61 | 。 | HALFWIDTH IDEOGRAPHIC FULL STOP |
Final dot
If the domain name has one final dot it should be removed to create a consistent representation. The exception is the root zone which is always represented by the exact string ".".
Upper case
If the domain name has any letters tagged as "upper case" by the Unicode database, those should be mapped into the equivalent lower case letter. This applies to both ASCII (i.e. "A-Z" mapped into "a-z") in both A- and U-labels and non-ASCII characters found in U-labels (RFC 5895, section 2). This mapping is done before a U-label is converted to A-label. A valid U-label must not contain any upper case letters.
For Zonemaster special rules applies to U+0049 (LATIN CAPITAL LETTER I) and U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
- LATIN CAPITAL LETTER I is downcased to U+0069 (LATIN SMALL LETTER I) also in Turkish and Azeri locale, i.e. not following the special Unicode rule in those locale (Unicode SpecialCasing).
- Label with LATIN CAPITAL LETTER I WITH DOT ABOVE should be rejected since normal downcasing gives a sequence not reasonable in a domain name context (see "Lowercase Mapping" in LATIN CAPITAL LETTER I WITH DOT ABOVE).
A-label and U-label
DNS can only handle A-labels, not U-label. In the test core suite of Zonemaster only A-labels are used. For normalization, all U-labels are converted to A-labels. Test cases will only handle an ASCII-only Domain Name. Conversion from U-label to A-label should be done as specified for IDNA2008, not IDNA2003.
Terminology
No special terminology for this specification.