Sunday, October 5, 2008

Unicode Word Breaks - UCFindTextBreak

Working through the various effects I have come quite quickly to text - so the holiday (the appearance of rapid progress) is over and it is back to text. I like text so all is right with the world.

The text effects work on either lines of text, words within lines or letters. In the PC version of Cello there is no distinction between a letter and a glyph - however gunning everything through CoreText things are a bit different and there is this distinction to deal with.

The issue of word breaks is also more complex if you are dealing with Unicode. The PC version of cello deals only with Roman text and the word breaks are based on white-space. This does not hold for all languages and character sets - for example Japanese words are not bounded by white-space so this method of detecting word breaks (obviously) won't work in the general case.

A solution to the detection of word breaks is to use Apple's Unicode Utilities which has a Text Break Locator that discover text breaks. The format is reasonably simple UCCreateTextBreakLocator creates a text break locator, UCFindTextBreak will then locate a text break in a run or UTF16 characters.

The amount of wriggeling to get UCFindTextBreak to work with a CFString is very modest.

No comments: