Yesterday, while doing some coding for an OSX app on Swift, I came across an interesting issue.
For some data parsing methods I had to use regex and the infamous NSRegularExpression class. After trying several regex on online regex testers -I owe you a thousand beers, Regex101 creator- and Sublime Text, I got the one that perfectly fitted the data, so I tried it on my app.
Then two problems appeared:
- One of the extracted data pieces was longer than expected.
- Another piece of data was never found, although it did match the pattern.
And so the tale of two String lengths began.
Trying to be methodical, I tried solving one issue at a time. First, the one with the wrong length:
Wrong ranges with NSRange and String
For some reason, the start matched the start of the regex, but the end didn't. I remembered that
NSTextCheckingResult just gave you a range instead of the matched string so something must had gone wrong there. I measured the length of the string and saw that there were indeed some extra characters.
But why did it only happen with this particular piece of data? A horrifying idea came to mind: F*CKING EMOJIS.
Yes, the only thing that this data had and the others didn't were the emojis.
For example purposes, let's take this string and this pattern
ME: Hey, what's up 😄😄😄?; TIME: 12:03
This would return
Hey, what's up 😄😄😄?; T instead of
Hey, what's up 😄😄😄?.
So 3 emojis, 3 extra chars. Everything started to make sense.
Knowing that emojis can be problematic with different character encodings, I searched the web. And apparently Stack Overflow had the answer, as it always does.
Let's look at the NSRegularExpression API:
So it takes a String... and a NSRange? That looks weird. I mean, as said on the SO question,
String implementation differ quite a bit and so do
So what if the list of
NSRange given by
NSTextCheckingResult can't be matched to the string you passed? Then you have at least two options:
- Turn your String into an NSString and try to match the NSrange with it. That's what I did, and it worked just fine.
- Turn the NSRange into a Range object.
That got me the whole piece of data without any extra chars.
One missing match
So after testing my regex a hundred times against the missing match on several regex testers and always getting the right result, it wasn't working on Swift. This was starting to get ridiculous.
Once more, I thought: "Why this match and only this?"
"Well, it's the last match, just at the end of the string..."
And then everything was clear. Remember that function I posted above? Didn't it take an NSRange which would tell it from which index to which one to try to match the pattern?
NSString.length will return the string length as an UTF-16 string,
String.characters.count will return the number of visible characters.
String.unicodeScalars.count will give you different counts, too.
So which one can you use with emojis? From what I've seen, only
String.utf16.count will give you the proper length to use with that function. There may be some more, though.
When changed the length parameter of NSRange initializer to that, the missing match was missing no more.
To sum up:
NSRangewhich gives you the match range, you can't directly apply it to your original String.
matchesInString(String, options: NSMatchingOptions, range: NSRange)needs an UTF-16 count to work properly when emojis - and some other rare or foreign characters - are present.
- I hate NSRegularExpression. And at this point, you probably do, too.