Fix #4248: Unicode code point escapes #4498

helixbass · 2017-04-13T01:02:18Z

Basically uses the suggestion made by @lydell in that issue. Just changed a couple things:

According to this blog post and the spec, the number of hex digits in a Unicode code point escape isn't limited (so treat one or more hex digits as valid)
Because of this, try to match through a closing } (rather than a fixed number of characters after \u{) for error reporting

While adding tests, found that a new u regex flag is required for unicode code point escapes to be processed, so added it to VALID_FLAGS. Not sure when Node added support for this (since it's required for these tests to pass)?

GeoffreyBooth · 2017-04-13T01:35:34Z

Should this maybe be targeting the 2 branch?

lydell · 2017-04-13T06:29:42Z

This is a really good start. You're really good at finding nice test cases! 👍

Here's a little problem:

/\u{a}b/.test '\u{a}b' # false
///\u{a}#{'b'}///.test '\u{a}b' # true

Isn't the above very surprising? Add an interpolation to your regex, and suddenly it works completely different.

What's happening here is that \u{a} is a valid sequence of characters in any ES regex, but only treated as an escape in regexes with the u flag (and in strings). In "old" regexes, \u{a} means the same thing as u\{a\}.

The reason that the above two CS examples works differently is because the first one compiles to a real ES regex, while the second one compiles to a RegExp call with string concatenation.

So ... would a user expect /\u{a}/ to contain an escape? I don't think that the average ES/CS programmer knows that you have to use the u flag in this case. So I would not be opposed to having CoffeeScript improve on ES here (after all, that's CS's job, isn't it?). CS already behaves differently regarding escapes in regexes anyway: /\ua/ (but not /\ua/u) is a valid ES regex (meaning /ua/), while CoffeeScript is helpful and points out your mistake (you probably meant /\u000a/).

And here's the important part: Since /\u{a/ would be rejected as an "invalid escape", but /\u{a}/ isn't, CS pretty much implies that the \u{a} in there is an escape, while in reality it matches the string u{a}. That's not very helpful.

My suggestion is to compile \u{XXXXX} to \uYYYY\uYYYY (and \u{XXXX} to \uXXXX). Also, throw an error for \u{110000} and up (which is an "early error" in ES; see shapesecurity/shift-parser-js#328).

In the 2 branch, we could omit the \u{XXXXX} to \uYYYY\uYYYY compilation for strings, as well as regexes with the u flag. That also has the benefit of making String.raw"\u{abcde}" and /\u{abcde}/.source being more predictable.

Another option would be to not flag invalid-looking \u{abc} escapes in non-u regexes without interpolation, or to flag even valid-looking \u{abc} escapes there ("error: \u{bcd} sure looks like an escape, but isn't here; write u{abc} instead"). I think this is neither easier to implement nor better for the user, though.

Sorry for the long ramble. Here's a summary:

The problem is not quite as easy as it might sound. (But my proposed solution shouldn't be terribly complicated either!)
CS has the potential to be very helpful here by finding programmer errors, and even improve on ES regexes a little.
I'm not opposed to adding this to the 1.x version, but I guess it is up to the
implementer to decide how much work (s)he wants to put down. (In theory it could be a breaking change, but it is too small to warrant a major version bump according to the 1.x versioning scheme, I'd say.)

helixbass · 2017-04-15T01:58:52Z

@lydell thanks for the thorough explanation, I think I followed all of your logic. I implemented your suggestion for 1.x ie rewriting code point escapes -> "normal" Unicode escapes. A couple comments on my implementation:

I tried to find an existing shared place in the code to add the code point limit-checking/rewriting but didn't succeed. I tried changing validateEscapes() to validateAndRewriteEscapes() and returning the input string with rewritten escapes but that seemed to mess up the lexing position so I retreated. makeDelimitedLiteral() also looked tempting but for one thing the meaning of its existing delimiter option (the delimiter for the generated literal) is different from the delimiter I'm passing to replaceUnicodeCodePointEscapes() (the delimiter from the source, for error offsets)
If we're maybe going to not rewrite code point escapes on 2, not sure how much refactoring that implies since I've lumped in the limit-checking (which would still need to be done) with the rewriting. Perhaps it could just add logic inside replaceUnicodeCodePointEscapes() to check whether to preserve the code point escape syntax

In what ways is this a breaking change for 1.x? Any sort of \u{... escapes were already being flagged as invalid so this is just making it a bit more lenient?

I agree that implementation-wise and CS-helpfulness-wise it's best to assume that they want a code point escape anytime you see \u{ (and again I think existing behavior is that it would fail so there shouldn't be code out there that means u by \u). At first I thought you were going to suggest automatically adding the u flag for them but I see that rewriting makes more sense to be compatible with pre-ES6 and since setting the u flag seems to trigger a fair amount of other behavior. So then my thought on your suggestions for 2 branch is that it should be an error ("set the u regex flag if you want to use Unicode code point escapes") if they specify a regex that includes a code point escape but doesn't have the u flag set. Either way though, choosing to not rewrite them on 2 branch implies some breaking changes compared to what this would introduce into 1.x. ie for a regex with code point escapes but no u flag, in 1.x it would work (since they're rewritten) but in 2 it would either error at compile time (my suggestion) or change behavior (since the \u would pass through and behave like u). So I don't know if that breakage when upgrading to 2 outweighs the correctness (you mentioned /.../.source and String.raw()) / "nativeness" of not rewriting? Or I guess you could do something gnarlier like only rewrite (on 2) if they don't set the u flag?

Also, is it the case that including the u flag in the output from 1.x could break against a JS runtime that doesn't yet understand the u flag (ie it would consider it invalid syntax)? In which case would it make sense to strip the u flag (since we're rewriting code points)? But then that could break other u flag behavior that they were trying to use?

lydell · 2017-04-16T09:32:47Z

src/lexer.coffee

+    str.replace UNICODE_CODE_POINT_ESCAPE, (match, before, codePointHex, offset) =>
+      codePointDecimal = parseInt codePointHex, 16
+      if codePointDecimal > 1114111
+        @error "Unicode code point escapes greater than 1114111 are not allowed",


I wonder if this is easier to understand for users:

unicode code point escapes greater than \u{10ffff} are not allowed

(CoffeeScript error messages also usually start with a lowercase letter)

lydell · 2017-04-16T10:47:36Z

@helixbass This looks really good!

You are right that this can't be a breaking change, since \u{ already throws an error. My bad.
Yes, I did mean that the \u{XXXXX} → \uYYYY\uYYYY compilation can only be removed in strings and u regexes in the 2 branch.
We should not make \u{XXXXX} an error in non-u regexes in the 2 branch. We'll just keep the \u{XXXXX} → \uYYYY\uYYYY compilation there.
No automatic u flag adding, because of the reasons you pointed out.
Finally, it should be fine adding \u{abcde} escaped and the u flag in the 1.x version, just like yield and import/export.

helixbass · 2017-04-17T22:22:18Z

@lydell thanks, ok updated error message. While working on the additional changes for 2 branch, ran into a bug that I'll explain in a diff comment. So updated implementation/tests on this branch and then added the 2-specific stuff on my iss4248_unicode_code_point_escapes_on_2 branch - I can open an additional pull request now against 2 or wait until this merges, as of now all of the 2-specific changes are in a single commit

helixbass · 2017-04-17T22:31:38Z

src/lexer.coffee

+    "#{toUnicodeEscape(high)}#{toUnicodeEscape(low)}"
+
+  # Replace \u{...} with \uxxxx[\uxxxx] in strings and regexes
+  replaceUnicodeCodePointEscapes: (str, options) ->


Found a bug that I hadn't tested for: adjacent code point escapes eg "\u{a}\u{b}" were not both being processed because of the "before" section of UNICODE_CODE_POINT_ESCAPE (( (?:^|[^\\]) (?:\\\\)* ), which I copied from STRING_INVALID_ESCAPE). Basically, since it looks for a non-zero-length (unless we're at the beginning of the string) verification that we're not inside an "escaped escape", the desired matches eg \u{a} and \u{b} become overlapping (because the second match would be }\u{b}). I couldn't figure out how to make the regex not have to overlap so that I could still use str.replace() so I switched to an (uglier) implementation using UNICODE_CODE_POINT_ESCAPE.exec() and some lastIndex tweaking

In test/regexps.coffee and test/strings.coffee, I copied the toJS() helper from test/modules.coffee so that I could add some rewriting tests on the generated JS (since code-point escapes and regular Unicode escapes seem indistinguishable to eg eq())

Can we not copy/paste this helper? You could maybe attach it to global in the runTests function in Cakefile.

@GeoffreyBooth sure this was already merged so I opened a new pull request #4522 where I refactored toJS() into test/support/helpers.coffee

lydell · 2017-04-18T06:11:40Z

src/lexer.coffee

+UNICODE_CODE_POINT_ESCAPE = ///
+  ( (?:^|[^\\]) (?:\\\\)* )        # make sure the escape isn’t escaped
+  \\u\{ ( [\da-fA-F]+ ) \}
+///g


I think this would be simpler than the .lastIndex trickery

UNICODE_CODE_POINT_ESCAPE = /// ( \\\\ ) | ( \\u \{ [\da-fA-F]+ \} ) ///g string.replace UNICODE_CODE_POINT_ESCAPE, (match, escapedBackslash, unicodeCodePointEscape) -> if escapedBackslash escapedBackslash else doTheUnicodeStuff unicodeCodePointEscape

helixbass · 2017-04-18T22:34:07Z

@lydell nice, much cleaner to be able to use str.replace(), thanks! Updated per your suggestion and added a couple escaped-backslash-related tests (and merged these changes into my iss4248_unicode_code_point_escapes_on_2 2-targeted branch)

lydell · 2017-04-19T06:03:04Z

src/lexer.coffee

+      return escapedBackslash if escapedBackslash
+
+      codePointDecimal = parseInt codePointHex, 16
+      if codePointDecimal > 1114111


Super minor thing, but I think the code would be clearer like this:

if codePointDecimal > 0x10ffff

What do you think? Doesn't really matter. After this, I'd like to merge this very nice PR! :)

lydell · 2017-04-19T06:03:12Z

src/lexer.coffee

+      return escapedBackslash if escapedBackslash
+
+      codePointDecimal = parseInt codePointHex, 16
+      if codePointDecimal > 1114111


Super minor thing, but I think the code would be clearer like this:

if codePointDecimal > 0x10ffff

What do you think? Doesn't really matter. After this, I'd like to merge this very nice PR! :)

helixbass · 2017-04-19T22:26:32Z

@lydell yup clearer in hex just like in the error message. Updated here and in iss4248_unicode_code_point_escapes_on_2 branch. Once this is merged I'll open a pull request of that branch against 2, thanks

lydell · 2017-04-20T06:03:34Z

Awesome work @helixbass! Thanks!

GeoffreyBooth · 2017-04-20T06:05:23Z

Nice work!

GeoffreyBooth · 2017-04-20T06:06:16Z

@helixbass we generally merge master into 2 and resolve conflicts, so I would suggest you take that approach for getting this into 2. Thanks!

helixbass · 2017-04-20T20:44:05Z

@GeoffreyBooth sure no problem, I merged master into 2 on my 2_merged_master branch, merged that into my iss4248_unicode_code_point_escapes_on_2 branch (which has the 2-specific code point stuff) and opened a pull request for that against 2. As I commented there, I can open a separate pull request for just the merge-master-into-2 part if that's cleaner? Thanks!

GeoffreyBooth · 2017-04-20T20:58:35Z

Hmm, maybe I should merge master into 2 so that your new PR’s diff is cleaner . . . but I won’t know how to resolve conflicts if there are any. Did you have any conflicts you needed to resolve when you merged master into your copy of 2?

helixbass · 2017-04-20T21:51:37Z

@GeoffreyBooth ya that's what I figured, just opened a pull request which just merges master into 2 (there were some conflicts). So then if you merge that, the diff on #4520 will be just my 2-specific changes

Fix jashkenas#4248: Unicode code point escapes

5e16979

helixbass added 3 commits April 14, 2017 19:58

rewrite unicode code point escapes as unicode escapes

f02259e

smarter defaults

6fe8500

and resimplify

3685b76

correct surrogate pairs

49bce2a

lydell reviewed Apr 16, 2017

View reviewed changes

fixes from code review

709a474

helixbass force-pushed the iss4248_unicode_code_point_escapes branch from 1d20336 to 709a474 Compare April 17, 2017 16:03

handle adjacent code point escapes

d5affa1

helixbass commented Apr 17, 2017

View reviewed changes

lydell reviewed Apr 18, 2017

View reviewed changes

smarter regex

8f2c9b0

lydell reviewed Apr 19, 2017

View reviewed changes

fix from code review

2c73f11

lydell merged commit 96b6c5f into jashkenas:master Apr 20, 2017

helixbass mentioned this pull request Apr 20, 2017

[CS2] Unicode code point escapes on 2, fixes #4248 #4520

Merged

helixbass mentioned this pull request Apr 22, 2017

Iss4248 unicode code point escapes (cleanup) #4522

Merged

Fix #4248: Unicode code point escapes #4498

Fix #4248: Unicode code point escapes #4498

Uh oh!

Conversation

helixbass commented Apr 13, 2017

Uh oh!

GeoffreyBooth commented Apr 13, 2017

Uh oh!

lydell commented Apr 13, 2017

Uh oh!

helixbass commented Apr 15, 2017

Uh oh!

lydell Apr 16, 2017

Choose a reason for hiding this comment

Uh oh!

lydell commented Apr 16, 2017

Uh oh!

helixbass commented Apr 17, 2017

Uh oh!

helixbass Apr 17, 2017

Choose a reason for hiding this comment

Uh oh!

GeoffreyBooth Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

helixbass Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

lydell Apr 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helixbass commented Apr 18, 2017

Uh oh!

lydell Apr 19, 2017

Choose a reason for hiding this comment

Uh oh!

lydell Apr 19, 2017

Choose a reason for hiding this comment

Uh oh!

helixbass commented Apr 19, 2017

Uh oh!

lydell commented Apr 20, 2017

Uh oh!

GeoffreyBooth commented Apr 20, 2017

Uh oh!

GeoffreyBooth commented Apr 20, 2017

Uh oh!

helixbass commented Apr 20, 2017

Uh oh!

GeoffreyBooth commented Apr 20, 2017

Uh oh!

helixbass commented Apr 20, 2017

Uh oh!

Uh oh!

lydell Apr 18, 2017 •

edited

Loading