Break a Unicode string into words with module re or other Python

I am interested in breaking a Unicode string into words which brings up the cultural question of how to tell a word boundary. My understanding is that CJK languages, which are closer to one character, one word than we are, don't need to intersperse spaces for word boundaries as in English or *Greek. The best approximation I have now is to ask re to split on non-word characters and pass the re.UNICODE flag, and that would get most English words right (but treat hyphenated words as two words), basically work right for other languages that separate words with non-word characters, but not work right for languages where you don't need to insert an additional character to mark a word boundary.

Are you aware of a good, correct way to get all the words from a string, or is this a problem that qualifies as a tar-pit for the difficulty of a generic and correct solution?

  • Note on Greek: Modern Greek separates words by spaces and if you get a Greek New Testament words are separated by spaces, but if you look at some old New Testament manuscripts themselves, nothing in particular marks a word boundary. I think spaces are the sort of thing communities adopt to separate words in alphabetic languages once people realize it's an option, but I wonder if there are existing, living alphabetic languages where multiple characters make up a single word but there is not necessarily a non-word character or any other marker to indicate where one word ends and the next begins.
Written in WikklyText.

Reply

The content of this field is kept private and will not be shown publicly.