Skip to content

Fix #4248: Unicode code point escapes #4498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 20, 2017

Conversation

helixbass
Copy link
Collaborator

Fixes #4248

Basically uses the suggestion made by @lydell in that issue. Just changed a couple things:

  • According to this blog post and the spec, the number of hex digits in a Unicode code point escape isn't limited (so treat one or more hex digits as valid)
  • Because of this, try to match through a closing } (rather than a fixed number of characters after \u{) for error reporting

While adding tests, found that a new u regex flag is required for unicode code point escapes to be processed, so added it to VALID_FLAGS. Not sure when Node added support for this (since it's required for these tests to pass)?

@GeoffreyBooth
Copy link
Collaborator

Should this maybe be targeting the 2 branch?

@lydell
Copy link
Collaborator

lydell commented Apr 13, 2017

This is a really good start. You're really good at finding nice test cases! 👍

Here's a little problem:

/\u{a}b/.test '\u{a}b' # false
///\u{a}#{'b'}///.test '\u{a}b' # true

Isn't the above very surprising? Add an interpolation to your regex, and suddenly it works completely different.

What's happening here is that \u{a} is a valid sequence of characters in any ES regex, but only treated as an escape in regexes with the u flag (and in strings). In "old" regexes, \u{a} means the same thing as u\{a\}.

The reason that the above two CS examples works differently is because the first one compiles to a real ES regex, while the second one compiles to a RegExp call with string concatenation.

So ... would a user expect /\u{a}/ to contain an escape? I don't think that the average ES/CS programmer knows that you have to use the u flag in this case. So I would not be opposed to having CoffeeScript improve on ES here (after all, that's CS's job, isn't it?). CS already behaves differently regarding escapes in regexes anyway: /\ua/ (but not /\ua/u) is a valid ES regex (meaning /ua/), while CoffeeScript is helpful and points out your mistake (you probably meant /\u000a/).

And here's the important part: Since /\u{a/ would be rejected as an "invalid escape", but /\u{a}/ isn't, CS pretty much implies that the \u{a} in there is an escape, while in reality it matches the string u{a}. That's not very helpful.

My suggestion is to compile \u{XXXXX} to \uYYYY\uYYYY (and \u{XXXX} to \uXXXX). Also, throw an error for \u{110000} and up (which is an "early error" in ES; see shapesecurity/shift-parser-js#328).

In the 2 branch, we could omit the \u{XXXXX} to \uYYYY\uYYYY compilation for strings, as well as regexes with the u flag. That also has the benefit of making String.raw"\u{abcde}" and /\u{abcde}/.source being more predictable.

Another option would be to not flag invalid-looking \u{abc} escapes in non-u regexes without interpolation, or to flag even valid-looking \u{abc} escapes there ("error: \u{bcd} sure looks like an escape, but isn't here; write u{abc} instead"). I think this is neither easier to implement nor better for the user, though.

Sorry for the long ramble. Here's a summary:

  • The problem is not quite as easy as it might sound. (But my proposed solution shouldn't be terribly complicated either!)
  • CS has the potential to be very helpful here by finding programmer errors, and even improve on ES regexes a little.
  • I'm not opposed to adding this to the 1.x version, but I guess it is up to the
    implementer to decide how much work (s)he wants to put down. (In theory it could be a breaking change, but it is too small to warrant a major version bump according to the 1.x versioning scheme, I'd say.)

@helixbass
Copy link
Collaborator Author

@lydell thanks for the thorough explanation, I think I followed all of your logic. I implemented your suggestion for 1.x ie rewriting code point escapes -> "normal" Unicode escapes. A couple comments on my implementation:

  • I tried to find an existing shared place in the code to add the code point limit-checking/rewriting but didn't succeed. I tried changing validateEscapes() to validateAndRewriteEscapes() and returning the input string with rewritten escapes but that seemed to mess up the lexing position so I retreated. makeDelimitedLiteral() also looked tempting but for one thing the meaning of its existing delimiter option (the delimiter for the generated literal) is different from the delimiter I'm passing to replaceUnicodeCodePointEscapes() (the delimiter from the source, for error offsets)
  • If we're maybe going to not rewrite code point escapes on 2, not sure how much refactoring that implies since I've lumped in the limit-checking (which would still need to be done) with the rewriting. Perhaps it could just add logic inside replaceUnicodeCodePointEscapes() to check whether to preserve the code point escape syntax

In what ways is this a breaking change for 1.x? Any sort of \u{... escapes were already being flagged as invalid so this is just making it a bit more lenient?

I agree that implementation-wise and CS-helpfulness-wise it's best to assume that they want a code point escape anytime you see \u{ (and again I think existing behavior is that it would fail so there shouldn't be code out there that means u by \u). At first I thought you were going to suggest automatically adding the u flag for them but I see that rewriting makes more sense to be compatible with pre-ES6 and since setting the u flag seems to trigger a fair amount of other behavior. So then my thought on your suggestions for 2 branch is that it should be an error ("set the u regex flag if you want to use Unicode code point escapes") if they specify a regex that includes a code point escape but doesn't have the u flag set. Either way though, choosing to not rewrite them on 2 branch implies some breaking changes compared to what this would introduce into 1.x. ie for a regex with code point escapes but no u flag, in 1.x it would work (since they're rewritten) but in 2 it would either error at compile time (my suggestion) or change behavior (since the \u would pass through and behave like u). So I don't know if that breakage when upgrading to 2 outweighs the correctness (you mentioned /.../.source and String.raw()) / "nativeness" of not rewriting? Or I guess you could do something gnarlier like only rewrite (on 2) if they don't set the u flag?

Also, is it the case that including the u flag in the output from 1.x could break against a JS runtime that doesn't yet understand the u flag (ie it would consider it invalid syntax)? In which case would it make sense to strip the u flag (since we're rewriting code points)? But then that could break other u flag behavior that they were trying to use?

src/lexer.coffee Outdated
str.replace UNICODE_CODE_POINT_ESCAPE, (match, before, codePointHex, offset) =>
codePointDecimal = parseInt codePointHex, 16
if codePointDecimal > 1114111
@error "Unicode code point escapes greater than 1114111 are not allowed",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is easier to understand for users:

unicode code point escapes greater than \u{10ffff} are not allowed

(CoffeeScript error messages also usually start with a lowercase letter)

@lydell
Copy link
Collaborator

lydell commented Apr 16, 2017

@helixbass This looks really good!

  • You are right that this can't be a breaking change, since \u{ already throws an error. My bad.
  • Yes, I did mean that the \u{XXXXX}\uYYYY\uYYYY compilation can only be removed in strings and u regexes in the 2 branch.
  • We should not make \u{XXXXX} an error in non-u regexes in the 2 branch. We'll just keep the \u{XXXXX}\uYYYY\uYYYY compilation there.
  • No automatic u flag adding, because of the reasons you pointed out.
  • Finally, it should be fine adding \u{abcde} escaped and the u flag in the 1.x version, just like yield and import/export.

@helixbass helixbass force-pushed the iss4248_unicode_code_point_escapes branch from 1d20336 to 709a474 Compare April 17, 2017 16:03
@helixbass
Copy link
Collaborator Author

@lydell thanks, ok updated error message. While working on the additional changes for 2 branch, ran into a bug that I'll explain in a diff comment. So updated implementation/tests on this branch and then added the 2-specific stuff on my iss4248_unicode_code_point_escapes_on_2 branch - I can open an additional pull request now against 2 or wait until this merges, as of now all of the 2-specific changes are in a single commit

"#{toUnicodeEscape(high)}#{toUnicodeEscape(low)}"

# Replace \u{...} with \uxxxx[\uxxxx] in strings and regexes
replaceUnicodeCodePointEscapes: (str, options) ->
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a bug that I hadn't tested for: adjacent code point escapes eg "\u{a}\u{b}" were not both being processed because of the "before" section of UNICODE_CODE_POINT_ESCAPE (( (?:^|[^\\]) (?:\\\\)* ), which I copied from STRING_INVALID_ESCAPE). Basically, since it looks for a non-zero-length (unless we're at the beginning of the string) verification that we're not inside an "escaped escape", the desired matches eg \u{a} and \u{b} become overlapping (because the second match would be }\u{b}). I couldn't figure out how to make the regex not have to overlap so that I could still use str.replace() so I switched to an (uglier) implementation using UNICODE_CODE_POINT_ESCAPE.exec() and some lastIndex tweaking

In test/regexps.coffee and test/strings.coffee, I copied the toJS() helper from test/modules.coffee so that I could add some rewriting tests on the generated JS (since code-point escapes and regular Unicode escapes seem indistinguishable to eg eq())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not copy/paste this helper? You could maybe attach it to global in the runTests function in Cakefile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeoffreyBooth sure this was already merged so I opened a new pull request #4522 where I refactored toJS() into test/support/helpers.coffee

src/lexer.coffee Outdated
UNICODE_CODE_POINT_ESCAPE = ///
( (?:^|[^\\]) (?:\\\\)* ) # make sure the escape isn’t escaped
\\u\{ ( [\da-fA-F]+ ) \}
///g
Copy link
Collaborator

@lydell lydell Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be simpler than the .lastIndex trickery

UNICODE_CODE_POINT_ESCAPE = ///
  ( \\\\ )
  |
  ( \\u \{ [\da-fA-F]+ \} )
///g
string.replace UNICODE_CODE_POINT_ESCAPE, (match, escapedBackslash, unicodeCodePointEscape) ->
  if escapedBackslash
    escapedBackslash
  else
    doTheUnicodeStuff unicodeCodePointEscape

@helixbass
Copy link
Collaborator Author

@lydell nice, much cleaner to be able to use str.replace(), thanks! Updated per your suggestion and added a couple escaped-backslash-related tests (and merged these changes into my iss4248_unicode_code_point_escapes_on_2 2-targeted branch)

src/lexer.coffee Outdated
return escapedBackslash if escapedBackslash

codePointDecimal = parseInt codePointHex, 16
if codePointDecimal > 1114111
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor thing, but I think the code would be clearer like this:

if codePointDecimal > 0x10ffff

What do you think? Doesn't really matter. After this, I'd like to merge this very nice PR! :)

src/lexer.coffee Outdated
return escapedBackslash if escapedBackslash

codePointDecimal = parseInt codePointHex, 16
if codePointDecimal > 1114111
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor thing, but I think the code would be clearer like this:

if codePointDecimal > 0x10ffff

What do you think? Doesn't really matter. After this, I'd like to merge this very nice PR! :)

@helixbass
Copy link
Collaborator Author

@lydell yup clearer in hex just like in the error message. Updated here and in iss4248_unicode_code_point_escapes_on_2 branch. Once this is merged I'll open a pull request of that branch against 2, thanks

@lydell lydell merged commit 96b6c5f into jashkenas:master Apr 20, 2017
@lydell
Copy link
Collaborator

lydell commented Apr 20, 2017

Awesome work @helixbass! Thanks!

@GeoffreyBooth
Copy link
Collaborator

Nice work!

@GeoffreyBooth
Copy link
Collaborator

@helixbass we generally merge master into 2 and resolve conflicts, so I would suggest you take that approach for getting this into 2. Thanks!

@helixbass
Copy link
Collaborator Author

@GeoffreyBooth sure no problem, I merged master into 2 on my 2_merged_master branch, merged that into my iss4248_unicode_code_point_escapes_on_2 branch (which has the 2-specific code point stuff) and opened a pull request for that against 2. As I commented there, I can open a separate pull request for just the merge-master-into-2 part if that's cleaner? Thanks!

@GeoffreyBooth
Copy link
Collaborator

Hmm, maybe I should merge master into 2 so that your new PR’s diff is cleaner . . . but I won’t know how to resolve conflicts if there are any. Did you have any conflicts you needed to resolve when you merged master into your copy of 2?

@helixbass
Copy link
Collaborator Author

@GeoffreyBooth ya that's what I figured, just opened a pull request which just merges master into 2 (there were some conflicts). So then if you merge that, the diff on #4520 will be just my 2-specific changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode code point escapes
3 participants