A tale of two String lengths - and Emojis

Yesterday, while doing some coding for an OSX app on Swift, I came across an interesting issue.

For some data parsing methods I had to use regex and the infamous NSRegularExpression class. After trying several regex on online regex testers -I owe you a thousand beers, Regex101 creator- and Sublime Text, I got the one that perfectly fitted the data, so I tried it on my app.

Then two problems appeared:

  • One of the extracted data pieces was longer than expected.
  • Another piece of data was never found, although it did match the pattern.

And so the tale of two String lengths began.

Trying to be methodical, I tried solving one issue at a time. First, the one with the wrong length:

Wrong ranges with NSRange and String

For some reason, the start matched the start of the regex, but the end didn't. I remembered that NSTextCheckingResult just gave you a range instead of the matched string so something must had gone wrong there. I measured the length of the string and saw that there were indeed some extra characters.

But why did it only happen with this particular piece of data? A horrifying idea came to mind: F*CKING EMOJIS.

Yes, the only thing that this data had and the others didn't were the emojis.

For example purposes, let's take this string and this pattern ME: (.*?);:

ME: Hey, what's up πŸ˜„πŸ˜„πŸ˜„?; TIME: 12:03  

This would return Hey, what's up πŸ˜„πŸ˜„πŸ˜„?; T instead of Hey, what's up πŸ˜„πŸ˜„πŸ˜„?.

So 3 emojis, 3 extra chars. Everything started to make sense.

Knowing that emojis can be problematic with different character encodings, I searched the web. And apparently Stack Overflow had the answer, as it always does.

Let's look at the NSRegularExpression API:

So it takes a String... and a NSRange? That looks weird. I mean, as said on the SO question, NSString and String implementation differ quite a bit and so do NSRange and Range.

So what if the list of NSRange given by NSTextCheckingResult can't be matched to the string you passed? Then you have at least two options:

  • Turn your String into an NSString and try to match the NSrange with it. That's what I did, and it worked just fine.
  • Turn the NSRange into a Range object.

That got me the whole piece of data without any extra chars.

One missing match

So after testing my regex a hundred times against the missing match on several regex testers and always getting the right result, it wasn't working on Swift. This was starting to get ridiculous.

Once more, I thought: "Why this match and only this?"

"Well, it's the last match, just at the end of the string..."

And then everything was clear. Remember that function I posted above? Didn't it take an NSRange which would tell it from which index to which one to try to match the pattern?

Apparently, while NSString.length will return the string length as an UTF-16 string, String.characters.count will return the number of visible characters. String.utf8.count and String.unicodeScalars.count will give you different counts, too.

So which one can you use with emojis? From what I've seen, only NSString.length and String.utf16.count will give you the proper length to use with that function. There may be some more, though.

When changed the length parameter of NSRange initializer to that, the missing match was missing no more.

To sum up:

  • While NSTextCheckingResult returns an NSRange which gives you the match range, you can't directly apply it to your original String.
  • The NSRange on matchesInString(String, options: NSMatchingOptions, range: NSRange) needs an UTF-16 count to work properly when emojis - and some other rare or foreign characters - are present.
  • I hate NSRegularExpression. And at this point, you probably do, too.