Skip to main content

Parse complex emojis with the tokens() function

Every emoji has a Unicode representation. But did you know that some emojis are actually Unicode combinations of other emojis? Check this out if you're interested.

Of course, this can create problems when you're processing text in ClickHouse.

Fortunately, you can use the tokens() function to extract words from text in ClickHouse while preserving combined emojis. This works even where a regex would fail:

WITH 'this is a test. And you know what that means! โค๏ธ ๐Ÿคฏ ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ #whatever @text' AS text
SELECT
extractAll(text, '[\\p{L}\\p{N}\\p{S}]+') AS words,
tokens(text) AS tokens
FORMAT Vertical

Query id: 9e40796f-698b-44d4-ac2c-33a9b7eb511b

Row 1:
โ”€โ”€โ”€โ”€โ”€โ”€
words: ['this','is','a','test','And','you','know','what','that','means','โค','๐Ÿคฏ','๐Ÿ‘จ','๐Ÿ‘จ','๐Ÿ‘ง','๐Ÿ‘ฆ','whatever','text']
tokens: ['this','is','a','test','And','you','know','what','that','means','โค๏ธ','๐Ÿคฏ','๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ','whatever','text']

1 row in set. Elapsed: 0.006 sec.