Next: , Previous: Preparation, Up: Top


3 Utility Functions

The rest of this library makes extensive use of Unicode characters. In order to interface this library with the outside world, your application may need to make various Unicode transformations.

3.1 Header file stringprep.h

To use the functions explained in this chapter, you need to include the file stringprep.h using:

     #include <stringprep.h>

3.2 Unicode Encoding Transformation

stringprep_unichar_to_utf8

— Function: int stringprep_unichar_to_utf8 (uint32_t c, char * outbuf)

c: a ISO10646 character code

outbuf: output buffer, must have at least 6 bytes of space. If NULL, the length will be computed and returned and nothing will be written to outbuf.

Converts a single character to UTF-8.

Return value: number of bytes written.

stringprep_utf8_to_unichar

— Function: uint32_t stringprep_utf8_to_unichar (const char * p)

p: a pointer to Unicode character encoded as UTF-8

Converts a sequence of bytes encoded as UTF-8 to a Unicode character. If p does not point to a valid UTF-8 encoded character, results are undefined.

Return value: the resulting character.

stringprep_ucs4_to_utf8

— Function: char * stringprep_ucs4_to_utf8 (const uint32_t * str, ssize_t len, size_t * items_read, size_t * items_written)

str: a UCS-4 encoded string

len: the maximum length of str to use. If len < 0, then the string is terminated with a 0 character.

items_read: location to store number of characters read read, or NULL.

items_written: location to store number of bytes written or NULL. The value here stored does not include the trailing 0 byte.

Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.

Return value: a pointer to a newly allocated UTF-8 string. This value must be freed with free(). If an error occurs, NULL will be returned and error set.

stringprep_utf8_to_ucs4

— Function: uint32_t * stringprep_utf8_to_ucs4 (const char * str, ssize_t len, size_t * items_written)

str: a UTF-8 encoded string

len: the maximum length of str to use. If len < 0, then the string is nul-terminated.

items_written: location to store the number of characters in the result, or NULL.

Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input. This function does no error checking on the input.

Return value: a pointer to a newly allocated UCS-4 string. This value must be freed with free().

3.3 Unicode Normalization

stringprep_ucs4_nfkc_normalize

— Function: uint32_t * stringprep_ucs4_nfkc_normalize (uint32_t * str, ssize_t len)

str: a Unicode string.

len: length of str array, or -1 if str is nul-terminated.

Converts UCS4 string into UTF-8 and runs stringprep_utf8_nfkc_normalize().

Return value: a newly allocated Unicode string, that is the NFKC normalized form of str.

stringprep_utf8_nfkc_normalize

— Function: char * stringprep_utf8_nfkc_normalize (const char * str, ssize_t len)

str: a UTF-8 encoded string.

len: length of str, in bytes, or -1 if str is nul-terminated.

Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character.

The normalization mode is NFKC (ALL COMPOSE). It standardizes differences that do not affect the text content, such as the above-mentioned accent representation. It standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. It returns a result with composed forms rather than a maximally decomposed form.

Return value: a newly allocated string, that is the NFKC normalized form of str.

3.4 Character Set Conversion

stringprep_locale_charset

— Function: const char * stringprep_locale_charset ( void)

Find out current locale charset. The function respect the CHARSET environment variable, but typically uses nl_langinfo(CODESET) when it is supported. It fall back on "ASCII" if CHARSET isn't set and nl_langinfo isn't supported or return anything.

Note that this function return the application's locale's preferred charset (or thread's locale's preffered charset, if your system support thread-specific locales). It does not return what the system may be using. Thus, if you receive data from external sources you cannot in general use this function to guess what charset it is encoded in. Use stringprep_convert from the external representation into the charset returned by this function, to have data in the locale encoding.

Return value: Return the character set used by the current locale. It will never return NULL, but use "ASCII" as a fallback.

stringprep_convert

— Function: char * stringprep_convert (const char * str, const char * to_codeset, const char * from_codeset)

str: input zero-terminated string.

to_codeset: name of destination character set.

from_codeset: name of origin character set, as used by str.

Convert the string from one character set to another using the system's iconv() function.

Return value: Returns newly allocated zero-terminated string which is str transcoded into to_codeset.

stringprep_locale_to_utf8

— Function: char * stringprep_locale_to_utf8 (const char * str)

str: input zero terminated string.

Convert string encoded in the locale's character set into UTF-8 by using stringprep_convert().

Return value: Returns newly allocated zero-terminated string which is str transcoded into UTF-8.

stringprep_utf8_to_locale

— Function: char * stringprep_utf8_to_locale (const char * str)

str: input zero terminated string.

Convert string encoded in UTF-8 into the locale's character set by using stringprep_convert().

Return value: Returns newly allocated zero-terminated string which is str transcoded into the locale's character set.