src/linebreak/linebreak.c File Reference
Enumerations |
enum | BreakAction {
DIR_BRK,
IND_BRK,
CMI_BRK,
CMP_BRK,
PRH_BRK
} |
Functions |
void | init_linebreak (void) |
utf32_t | lb_get_next_char_utf8 (const utf8_t *s, size_t len, size_t *ip) |
utf32_t | lb_get_next_char_utf16 (const utf16_t *s, size_t len, size_t *ip) |
utf32_t | lb_get_next_char_utf32 (const utf32_t *s, size_t len, size_t *ip) |
void | set_linebreaks (const void *s, size_t len, const char *lang, char *brks, get_next_char_t get_next_char) |
void | set_linebreaks_utf8 (const utf8_t *s, size_t len, const char *lang, char *brks) |
void | set_linebreaks_utf16 (const utf16_t *s, size_t len, const char *lang, char *brks) |
void | set_linebreaks_utf32 (const utf32_t *s, size_t len, const char *lang, char *brks) |
int | is_line_breakable (utf32_t char1, utf32_t char2, const char *lang) |
Variables |
const int | linebreak_version = LINEBREAK_VERSION |
Detailed Description
Implementation of the line breaking algorithm as described in Unicode Standard Annex 14.
- Version:
- 2.1, 2011/05/07
- Author:
- Wu Yongwei
Enumeration Type Documentation
Enumeration of break actions. They are used in the break action pair table below.
- Enumerator:
DIR_BRK |
Direct break opportunity
|
IND_BRK |
Indirect break opportunity
|
CMI_BRK |
Indirect break opportunity for combining marks
|
CMP_BRK |
Prohibited break for combining marks
|
PRH_BRK |
Prohibited break
|
Function Documentation
void init_linebreak |
( |
void |
|
) |
|
Initializes the second-level index to the line breaking properties. If it is not called, the performance of get_char_lb_class_lang (and thus the main functionality) can be pretty bad, especially for big code points like those of Chinese.
utf32_t lb_get_next_char_utf8 |
( |
const utf8_t * |
s, |
|
|
size_t |
len, |
|
|
size_t * |
ip | |
|
) |
| | |
Gets the next Unicode character in a UTF-8 sequence. The index will be advanced to the next complete character, unless the end of string is reached in the middle of a UTF-8 sequence.
- Parameters:
-
[in] | s | input UTF-8 string |
[in] | len | length of the string in bytes |
[in,out] | ip | pointer to the index |
- Returns:
- the Unicode character beginning at the index; or EOS if end of input is encountered
utf32_t lb_get_next_char_utf16 |
( |
const utf16_t * |
s, |
|
|
size_t |
len, |
|
|
size_t * |
ip | |
|
) |
| | |
Gets the next Unicode character in a UTF-16 sequence. The index will be advanced to the next complete character, unless the end of string is reached in the middle of a UTF-16 surrogate pair.
- Parameters:
-
[in] | s | input UTF-16 string |
[in] | len | length of the string in words |
[in,out] | ip | pointer to the index |
- Returns:
- the Unicode character beginning at the index; or EOS if end of input is encountered
utf32_t lb_get_next_char_utf32 |
( |
const utf32_t * |
s, |
|
|
size_t |
len, |
|
|
size_t * |
ip | |
|
) |
| | |
Gets the next Unicode character in a UTF-32 sequence. The index will be advanced to the next character.
- Parameters:
-
[in] | s | input UTF-32 string |
[in] | len | length of the string in dwords |
[in,out] | ip | pointer to the index |
- Returns:
- the Unicode character beginning at the index; or EOS if end of input is encountered
void set_linebreaks |
( |
const void * |
s, |
|
|
size_t |
len, |
|
|
const char * |
lang, |
|
|
char * |
brks, |
|
|
get_next_char_t |
get_next_char | |
|
) |
| | |
Sets the line breaking information for a generic input string.
- Parameters:
-
[in] | s | input string |
[in] | len | length of the input |
[in] | lang | language of the input |
[out] | brks | pointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR |
[in] | get_next_char | function to get the next UTF-32 character |
void set_linebreaks_utf8 |
( |
const utf8_t * |
s, |
|
|
size_t |
len, |
|
|
const char * |
lang, |
|
|
char * |
brks | |
|
) |
| | |
Sets the line breaking information for a UTF-8 input string.
- Parameters:
-
[in] | s | input UTF-8 string |
[in] | len | length of the input |
[in] | lang | language of the input |
[out] | brks | pointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR |
void set_linebreaks_utf16 |
( |
const utf16_t * |
s, |
|
|
size_t |
len, |
|
|
const char * |
lang, |
|
|
char * |
brks | |
|
) |
| | |
Sets the line breaking information for a UTF-16 input string.
- Parameters:
-
[in] | s | input UTF-16 string |
[in] | len | length of the input |
[in] | lang | language of the input |
[out] | brks | pointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR |
void set_linebreaks_utf32 |
( |
const utf32_t * |
s, |
|
|
size_t |
len, |
|
|
const char * |
lang, |
|
|
char * |
brks | |
|
) |
| | |
Sets the line breaking information for a UTF-32 input string.
- Parameters:
-
[in] | s | input UTF-32 string |
[in] | len | length of the input |
[in] | lang | language of the input |
[out] | brks | pointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR |
int is_line_breakable |
( |
utf32_t |
char1, |
|
|
utf32_t |
char2, |
|
|
const char * |
lang | |
|
) |
| | |
Tells whether a line break can occur between two Unicode characters. This is a wrapper function to expose a simple interface. Generally speaking, it is better to use set_linebreaks_utf32 instead, since complicated cases involving combining marks, spaces, etc. cannot be correctly processed.
- Parameters:
-
| char1 | the first Unicode character |
| char2 | the second Unicode character |
| lang | language of the input |
- Returns:
- one of LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
Variable Documentation
Version number of the library.