Next: TLD Functions, Previous: Punycode Functions, Up: Top
Until now, there has been no standard method for domain names to use characters outside the ASCII repertoire. The IDNA document defines internationalized domain names (IDNs) and a mechanism called IDNA for handling them in a standard fashion. IDNs use characters drawn from a large repertoire (Unicode), but IDNA allows the non-ASCII characters to be represented using only the ASCII characters already allowed in so-called host names today. This backward-compatible representation is required in existing protocols like DNS, so that IDNs can be introduced with no changes to the existing infrastructure. IDNA is only meant for processing domain names, not free text.
idna.h
To use the functions explained in this chapter, you need to include the file idna.h using:
#include <idna.h>
The IDNA flags
parameter can take on the following values, or a
bit-wise inclusive or of any subset of the parameters:
Check output to make sure it is a STD3 conforming host name.
The idea behind the IDNA function names are as follows: the
idna_to_ascii_4i
and idna_to_unicode_44i
functions are
the core IDNA primitives. The 4
indicate that the function
takes UCS-4 strings (i.e., Unicode code points encoded in a 32-bit
unsigned integer type) of the specified length. The i
indicate
that the data is written “inline” into the buffer. This means the
caller is responsible for allocating (and deallocating) the string,
and providing the library with the allocated length of the string.
The output length is written in the output length variable. The
remaining functions all contain the z
indicator, which means
the strings are zero terminated. All output strings are allocated by
the library, and must be deallocated by the caller. The 4
indicator again means that the string is UCS-4, the 8
means the
strings are UTF-8 and the l
indicator means the strings are
encoded in the encoding used by the current locale.
The functions provided are the following entry points:
in: input array with unicode code points.
inlen: length of input array with unicode code points.
out: output zero terminated string that must have room for at least 63 characters plus the terminating zero.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.The ToASCII operation takes a sequence of Unicode code points that make up one label and transforms it into a sequence of code points in the ASCII range (0..7F). If ToASCII succeeds, the original sequence and the resulting sequence are equivalent labels.
It is important to note that the ToASCII operation can fail. ToASCII fails if any step of it fails. If any step of the ToASCII operation fails on any label in a domain name, that domain name MUST NOT be used as an internationalized domain name. The method for deadling with this failure is application-specific.
The inputs to ToASCII are a sequence of code points, the AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of ToASCII is either a sequence of ASCII code points or a failure condition.
ToASCII never alters a sequence of code points that are all in the ASCII range to begin with (although it could fail). Applying the ToASCII operation multiple times has exactly the same effect as applying it just once.
Return value: Returns 0 on success, or an
Idna_rc
error code.
in: input array with unicode code points.
inlen: length of input array with unicode code points.
out: output array with unicode code points.
outlen: on input, maximum size of output array with unicode code points, on exit, actual size of output array with unicode code points.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.The ToUnicode operation takes a sequence of Unicode code points that make up one label and returns a sequence of Unicode code points. If the input sequence is a label in ACE form, then the result is an equivalent internationalized label that is not in ACE form, otherwise the original sequence is returned unaltered.
ToUnicode never fails. If any step fails, then the original input sequence is returned immediately in that step.
The Punycode decoder can never output more code points than it inputs, but Nameprep can, and therefore ToUnicode can. Note that the number of octets needed to represent a sequence of code points depends on the particular character encoding used.
The inputs to ToUnicode are a sequence of code points, the AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of ToUnicode is always a sequence of Unicode code points.
Return value: Returns
Idna_rc
error condition, but it must only be used for debugging purposes. The output buffer is always guaranteed to contain the correct data according to the specification (sans malloc induced errors). NB! This means that you normally ignore the return code from this function, as checking it means breaking the standard.
input: zero terminated input Unicode string.
output: pointer to newly allocated output string.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert UCS-4 domain name to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero terminated input UTF-8 string.
output: pointer to newly allocated output string.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert UTF-8 domain name to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero terminated input string encoded in the current locale's character set.
output: pointer to newly allocated output string.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert domain name in the locale's encoding to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero-terminated Unicode string.
output: pointer to newly allocated output Unicode string.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert possibly ACE encoded domain name in UCS-4 format into a UCS-4 string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero-terminated UTF-8 string.
output: pointer to newly allocated output Unicode string.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert possibly ACE encoded domain name in UTF-8 format into a UCS-4 string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero-terminated UTF-8 string.
output: pointer to newly allocated output UTF-8 string.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert possibly ACE encoded domain name in UTF-8 format into a UTF-8 string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero-terminated UTF-8 string.
output: pointer to newly allocated output string encoded in the current locale's character set.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert possibly ACE encoded domain name in UTF-8 format into a string encoded in the current locale's character set. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
input: zero-terminated string encoded in the current locale's character set.
output: pointer to newly allocated output string encoded in the current locale's character set.
flags: an
Idna_flags
value, e.g.,IDNA_ALLOW_UNASSIGNED
orIDNA_USE_STD3_ASCII_RULES
.Convert possibly ACE encoded domain name in the locale's character set into a string encoded in the current locale's character set. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
Return value: Returns
IDNA_SUCCESS
on success, or error code.
rc: an
Idna_rc
return code.Convert a return code integer to a text string. This string can be used to output a diagnostic message to the user.
IDNA_SUCCESS: Successful operation. This value is guaranteed to always be zero, the remaining ones are only guaranteed to hold non-zero values, for logical comparison purposes.
IDNA_STRINGPREP_ERROR: Error during string preparation.
IDNA_PUNYCODE_ERROR: Error during punycode operation.
IDNA_CONTAINS_NON_LDH: For IDNA_USE_STD3_ASCII_RULES, indicate that the string contains non-LDH ASCII characters.
IDNA_CONTAINS_MINUS: For IDNA_USE_STD3_ASCII_RULES, indicate that the string contains a leading or trailing hyphen-minus (U+002D).
IDNA_INVALID_LENGTH: The final output string is not within the (inclusive) range 1 to 63 characters.
IDNA_NO_ACE_PREFIX: The string does not contain the ACE prefix (for ToUnicode).
IDNA_ROUNDTRIP_VERIFY_ERROR: The ToASCII operation on output string does not equal the input.
IDNA_CONTAINS_ACE_PREFIX: The input contains the ACE prefix (for ToASCII).
IDNA_ICONV_ERROR: Could not convert string in locale encoding.
IDNA_MALLOC_ERROR: Could not allocate buffer (this is typically a fatal error).
IDNA_DLOPEN_ERROR: Could not dlopen the libcidn DSO (only used internally in libc).
Return value: Returns a pointer to a statically allocated string containing a description of the error with the return code
rc
.