Line breaking rules

The position at which a line break may occur is primarily dependent on the language defined for the context in which the line break occurs. For example, when processing Western European languages that use white space to identify word breaks, a line can be broken by hyphenating the last word on the line, or after a user defined word break character if the following character is permitted to start of the next line. The table below lists the characters that are not permitted at the start of a line unless followed immediately by an alphabetic or numeric character:

Code Name
U+0021 EXCLAMATION MARK
U+0025 PERCENT SIGN
U+0029 RIGHT PARENTHESIS
U+002C COMMA
U+002E FULL STOP
U+003A COLON
U+003B SEMICOLON
U+003F QUESTION MARK
U+005D RIGHT SQUARE BRACKET
U+007D RIGHT CURLY BRACKET
U+00A8 DIAERESIS
U+00B0 DEGREE SIGN
U+00B7 MIDDLE DOT
U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+02C7 CARON
U+02C9 MODIFIER LETTER MACRON

CJK languages

When processing Chinese, Japanese, or Korean content, a line break can occur after a white space, after a user defined word break character, or any CJK character that is permitted at the end of a line, and where the following character is permitted to start of the next line.

The following characters are not permitted at the start of a line when processing CJK content:

Code Name
U+0021 EXCLAMATION MARK
U+0025 PERCENT SIGN
U+0029 RIGHT PARENTHESIS
U+002C COMMA
U+002E FULL STOP
U+003A COLON
U+003B SEMICOLON
U+003F QUESTION MARK
U+005D RIGHT SQUARE BRACKET
U+007D RIGHT CURLY BRACKET
U+00A8 DIAERESIS
U+00B0 DEGREE SIGN
U+00B7 MIDDLE DOT
U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+02C7 CARON
U+02C9 MODIFIER LETTER MACRON
U+2010 HYPHEN
U+2013 ENDASH
U+2014 EMDASH
U+2016 DOUBLE VERTICAL PRIME
U+2019 RIGHT SINGLE QUOTATION MARK
U+201D RIGHT DOUBLE QUOTATION MARK
U+2022 BULLET
U+2025 TWO DOT LEADER
U+2026 HORIZONTAL ELLIPSIS
U+2032 PRIME
U+2033 DOUBLE PRIME
U+203C DOUBLE EXCLAMATION MARK
U+2047 DOUBLE QUESTION MARK
U+2048 QUESTION EXCLAMATION MARK
U+2049 EXCLAMATION QUESTION MARK
U+2103 DEGREE CELSIUS
U+2236 RATIO
U+3001 IDEOGRAPHIC COMMA
U+3002 IDEOGRAPHIC FULL STOP
U+3003 DITTO MARK
U+3005 IDEOGRAPHIC ITERATION MARK
U+3009 RIGHT ANGLE BRACKET
U+300B RIGHT DOUBLE ANGLE BRACKET
U+300C LEFT CORNER BRACKET
U+300D RIGHT CORNER BRACKET
U+300F RIGHT WHITE CORNER BRACKET
U+3011 RIGHT BLACK LENTICULAR BRACKET
U+3015 RIGHT TORTOISE SHELL BRACKET
U+3017 RIGHT WHITE LENTICULAR BRACKET
U+3019 RIGHT WHITE TORTOISE SHELL BRACKET
U+301C WAVE DASH
U+301E DOUBLE PRIME QUOTATION MARK
U+301F LOW DOUBLE PRIME QUOTATION MARK
U+303B VERTICAL IDEOGRAPHIC ITERATION MARK
U+3041 HIRAGANA LETTER SMALL A
U+3043 HIRAGANA LETTER SMALL I
U+3045 HIRAGANA LETTER SMALL U
U+3047 HIRAGANA LETTER SMALL E
U+3049 HIRAGANA LETTER SMALL O
U+3063 HIRAGANA LETTER SMALL TU
U+3083 HIRAGANA LETTER SMALL YA
U+3087 HIRAGANA LETTER SMALL YO
U+308E HIRAGANA LETTER SMALL WA
U+3095 HIRAGANA LETTER SMALL KA
U+3096 HIRAGANA LETTER SMALL KE
U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30A1 KATAKANA LETTER SMALL A
U+30A3 KATAKANA LETTER SMALL I
U+30A5 KATAKANA LETTER SMALL U
U+30A7 KATAKANA LETTER SMALL E
U+30A9 KATAKANA LETTER SMALL O
U+30C3 KATAKANA LETTER SMALL TU
U+30E3 KATAKANA LETTER SMALL YA
U+30E5 KATAKANA LETTER SMALL YU
U+30E7 KATAKANA LETTER SMALL YO
U+30EE KATAKANA LETTER SMALL WA
U+30F5 KATAKANA LETTER SMALL KA
U+30F6 KATAKANA LETTER SMALL KE
U+30FB KATAKANA MIDDLE DOT
U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+30FD KATAKANA ITERATION MARK
U+30FE KATAKANA VOICED ITERATION MARK
U+31F0 KATAKANA LETTER SMALL KU
U+31F1 KATAKANA LETTER SMALL SI
U+31F2 KATAKANA LETTER SMALL SU
U+31F3 KATAKANA LETTER SMALL TO
U+31F4 KATAKANA LETTER SMALL NU
U+31F5 KATAKANA LETTER SMALL HA
U+31F6 KATAKANA LETTER SMALL HI
U+31F7 KATAKANA LETTER SMALL HU
U+31F8 KATAKANA LETTER SMALL HE
U+31F9 KATAKANA LETTER SMALL HO
U+31FA KATAKANA LETTER SMALL MU
U+31FB KATAKANA LETTER SMALL RA
U+31FC KATAKANA LETTER SMALL R
U+31FD KATAKANA LETTER SMALL RU
U+31FE KATAKANA LETTER SMALL RE
U+31FF KATAKANA LETTER SMALL RO
U+FE4F WAVY LOW LINE
U+FE50 SMALL COMMA
U+FE52 SMALL FULL STOP
U+FE54 SMALL SEMICOLON
U+FE55 SMALL COLON
U+FE56 SMALL QUESTION MARK
U+FE57 SMALL EXCLAMATION MARK
U+FE5A SMALL RIGHT PARENTHESIS
U+FE5B SMALL LEFT CURLY BRACKET
U+FE5E SMALL RIGHT TORTOISE SHELL BRACKET
U+FF01 FULL-WIDTH EXCLAMATION MARK
U+FF02 FULL-WIDTH QUOTATION MARK
U+FF05 FULL-WIDTH PERCENT SIGN
U+FF07 FULL-WIDTH APOSTROPHE
U+FF09 FULL-WIDTH RIGHT PARENTHESIS
U+FF0C FULL-WIDTH COMMA
U+FF0E FULL-WIDTH FULL STOP
U+FF1A FULL-WIDTH COLON
U+FF1B FULL-WIDTH SEMICOLON
U+FF1F FULL-WIDTH QUESTION MARK
U+FF3D FULL-WIDTH RIGHT SQUARE BRACKET
U+FF40 FULL-WIDTH GRAVE ACCENT
U+FF5C FULL-WIDTH VERTICAL LINE
U+FF5D FULL-WIDTH RIGHT CURLY BRACKET
U+FF5E FULL-WIDTH TILDE
U+FF60 FULL-WIDTH RIGHT WHITE PARENTHESIS
U+FF64 HALF-WIDTH IDEOGRAPHIC COMMA
U+FFE0 FULL-WIDTH CENT SIGN
U+FFE6 FULL-WIDTH WON SIGN

The following characters are not permitted at the end of a line when processing CJK content:

Code Name
U+0028 EXCLAMATION MARK
U+003C LESS THAN
U+003F QUESTION MARK
U+005B LEFT SQUARE BRACKET
U+005C BACKSLASH
U+007B OPEN CURLY BRACKET
U+00AB LEFT POINTING DOUBLE ANGLE QUOTATION MARK
U+2018 LEFT SINGLE QUOTATION MARK
U+201C LEFT DOUBLE QUOTATION MARK
U+2035 REVERSE PRIME
U+3008 LEFT POINTING ANGLE BRACKET
U+300A LEFT DOUBLE ANGLE BRACKET
U+300C LEFT CORNER BRACKET
U+300E LEFT WHITE CORNER BRACKET
U+3010 LEFT BLACK LENTICULAR BRACKET
U+3014 LEFT TORTOISE SHELL BRACKET
U+3016 LEFT WHITE LENTICULAR BRACKET
U+3018 LEFT WHITE TORTOISE SHELL BRACKET
U+301D REVERSED DOUBLE PRIME QUOTATION MARK
U+FE59 SMALL LEFT PARENTHESIS
U+FE5B SMALL LEFT CURLY BRACKET
U+FE5D SMALL LEFT TORTOISE SHELL BRACKET
U+FF04 FULL-WIDTH DOLLAR SIGN
U+FF08 FULL-WIDTH LEFT PARENTHESIS
U+FF0E FULL-WIDTH FULL STOP
U+FF10 FULL-WIDTH ZERO
U+FF11 FULL-WIDTH ONE
U+FF12 FULL-WIDTH TWO
U+FF13 FULL-WIDTH THREE
U+FF14 FULL-WIDTH FOUR
U+FF15 FULL-WIDTH FIVE
U+FF16 FULL-WIDTH SIX
U+FF17 FULL-WIDTH SEVEN
U+FF18 FULL-WIDTH EIGHT
U+FF19 FULL-WIDTH NINE
U+FF3B FULL-WIDTH LEFT SQUARE BRACKET
U+FF5B FULL-WIDTH LEFT CURLY BRACKET
U+FF5F FULL-WIDTH LEFT WHITE PARENTHESIS
U+FFE6 FULL-WIDTH WON SIGN

RTL languages

The following characters are not permitted at the start of a line when processing RTL content:

Code Name
U+060C ARABIC COMMA
U+061B ARABIC SEMICOLON

Mixed script content

When processing content that consists of fragments of Western European, CJK, or RTL characters, then a line break can occur after a white space, a user defined a word break character, or at a line break point determined by applying language specific line breaking rules to the last word on the line. For example, if the last word on a line consists of a sequence of Western European characters, then Western European line breaking rules are applied to that word. If the last word on the line consists of CJK characters, then CJK line breaking rules are applied.

Legacy line breaking algorithm

The legacy line breaking algorithm was the default in all TopLeaf builds prior to 7.5.015. When this mode is enabled, language specific line breaking rules are not applied. Line breaks are only permitted after a white space or a user defined word break character.

If you need to continue using the legacy line breaking algorithm then declare the command:

<text-properties wordbreak="legacy" />

in your $document tag mapping.

URL line breaking

A restricted URL line breaking mode is enabled by declaring by the command:

<text-properties wordbreak="url" />

When this mode is active, language specific line breaking rules are not applied. Line breaks are only permitted after a white space, a user defined word break character, or the following characters:

Code Name
U+0023 NUMBER SIGN
U+0026 AMPERSAND
U+002B PLUS SIGN
U+002F SLASH (SOLIDUS)
U+003B SEMI COLON
U+003D EQUAL SIGN
U+003F QUESTION MARK
[Note] Note

This mode is deprecated. To process URL content, select the default line breaking rules and declare a set of preferred URL word break characters. For example:

<text-properties break-chars="#&amp;+/;=?" />