-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix #4248: Unicode code point escapes #4498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #4248: Unicode code point escapes #4498
Conversation
Should this maybe be targeting the |
This is a really good start. You're really good at finding nice test cases! 👍 Here's a little problem: /\u{a}b/.test '\u{a}b' # false
///\u{a}#{'b'}///.test '\u{a}b' # true Isn't the above very surprising? Add an interpolation to your regex, and suddenly it works completely different. What's happening here is that The reason that the above two CS examples works differently is because the first one compiles to a real ES regex, while the second one compiles to a So ... would a user expect And here's the important part: Since My suggestion is to compile In the Another option would be to not flag invalid-looking Sorry for the long ramble. Here's a summary:
|
@lydell thanks for the thorough explanation, I think I followed all of your logic. I implemented your suggestion for 1.x ie rewriting code point escapes -> "normal" Unicode escapes. A couple comments on my implementation:
In what ways is this a breaking change for 1.x? Any sort of I agree that implementation-wise and CS-helpfulness-wise it's best to assume that they want a code point escape anytime you see Also, is it the case that including the |
src/lexer.coffee
Outdated
str.replace UNICODE_CODE_POINT_ESCAPE, (match, before, codePointHex, offset) => | ||
codePointDecimal = parseInt codePointHex, 16 | ||
if codePointDecimal > 1114111 | ||
@error "Unicode code point escapes greater than 1114111 are not allowed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is easier to understand for users:
unicode code point escapes greater than \u{10ffff} are not allowed
(CoffeeScript error messages also usually start with a lowercase letter)
@helixbass This looks really good!
|
1d20336
to
709a474
Compare
@lydell thanks, ok updated error message. While working on the additional changes for |
"#{toUnicodeEscape(high)}#{toUnicodeEscape(low)}" | ||
|
||
# Replace \u{...} with \uxxxx[\uxxxx] in strings and regexes | ||
replaceUnicodeCodePointEscapes: (str, options) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a bug that I hadn't tested for: adjacent code point escapes eg "\u{a}\u{b}"
were not both being processed because of the "before" section of UNICODE_CODE_POINT_ESCAPE
(( (?:^|[^\\]) (?:\\\\)* )
, which I copied from STRING_INVALID_ESCAPE
). Basically, since it looks for a non-zero-length (unless we're at the beginning of the string) verification that we're not inside an "escaped escape", the desired matches eg \u{a}
and \u{b}
become overlapping (because the second match would be }\u{b}
). I couldn't figure out how to make the regex not have to overlap so that I could still use str.replace()
so I switched to an (uglier) implementation using UNICODE_CODE_POINT_ESCAPE.exec()
and some lastIndex
tweaking
In test/regexps.coffee
and test/strings.coffee
, I copied the toJS()
helper from test/modules.coffee
so that I could add some rewriting tests on the generated JS (since code-point escapes and regular Unicode escapes seem indistinguishable to eg eq()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not copy/paste this helper? You could maybe attach it to global
in the runTests
function in Cakefile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeoffreyBooth sure this was already merged so I opened a new pull request #4522 where I refactored toJS()
into test/support/helpers.coffee
src/lexer.coffee
Outdated
UNICODE_CODE_POINT_ESCAPE = /// | ||
( (?:^|[^\\]) (?:\\\\)* ) # make sure the escape isn’t escaped | ||
\\u\{ ( [\da-fA-F]+ ) \} | ||
///g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be simpler than the .lastIndex
trickery
UNICODE_CODE_POINT_ESCAPE = ///
( \\\\ )
|
( \\u \{ [\da-fA-F]+ \} )
///g
string.replace UNICODE_CODE_POINT_ESCAPE, (match, escapedBackslash, unicodeCodePointEscape) ->
if escapedBackslash
escapedBackslash
else
doTheUnicodeStuff unicodeCodePointEscape
@lydell nice, much cleaner to be able to use |
src/lexer.coffee
Outdated
return escapedBackslash if escapedBackslash | ||
|
||
codePointDecimal = parseInt codePointHex, 16 | ||
if codePointDecimal > 1114111 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor thing, but I think the code would be clearer like this:
if codePointDecimal > 0x10ffff
What do you think? Doesn't really matter. After this, I'd like to merge this very nice PR! :)
src/lexer.coffee
Outdated
return escapedBackslash if escapedBackslash | ||
|
||
codePointDecimal = parseInt codePointHex, 16 | ||
if codePointDecimal > 1114111 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor thing, but I think the code would be clearer like this:
if codePointDecimal > 0x10ffff
What do you think? Doesn't really matter. After this, I'd like to merge this very nice PR! :)
@lydell yup clearer in hex just like in the error message. Updated here and in |
Awesome work @helixbass! Thanks! |
Nice work! |
@helixbass we generally merge |
@GeoffreyBooth sure no problem, I merged |
Hmm, maybe I should merge |
@GeoffreyBooth ya that's what I figured, just opened a pull request which just merges |
Fixes #4248
Basically uses the suggestion made by @lydell in that issue. Just changed a couple things:
}
(rather than a fixed number of characters after\u{
) for error reportingWhile adding tests, found that a new
u
regex flag is required for unicode code point escapes to be processed, so added it toVALID_FLAGS
. Not sure when Node added support for this (since it's required for these tests to pass)?