-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make regexps in heuristics.yml
more portable
#6417
Make regexps in heuristics.yml
more portable
#6417
Conversation
This replaces `\R` with `\r?\n` in two places, which actually slightly changes the meaning, but I believe the regexps were using `\R` because that particular file type is common on Windows. In a third place, `\R` is replaced with `^`. This works because it follows `(?m:.*?)`, which matches newline characters.
This flag means something different in Ruby vs. other languages, including JavaScript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I'll merge this when I make the next release, hopefully in the next few weeks, depending on "you know what" 😉 @jorendorff.
Heads up @Alhadis as you deal with a lot of the regex queries and reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces subtle incompatibilities regarding newlines:
Change
\R
to\r?\n
This should actually be (?:\r?\n|\r)
, so that CR line-endings are handled consistently between programming environments.
I also replaced
.
with(?:.|\n)
or some other equivalent formula.
(?:.|\n)
won't match carriage returns in CRLF endings, so I recommend using (?:.|[\r\n])
or (?:\s|\S)
instead.
heuristics.yml
more portable
Note that for |
One could make the same argument about LF endings on Windows (which uses CRLF). However, it's better that we enforce a consistent strategy for matching newlines in our heuristics, as it eliminates this sort of headache. |
It's different for LF endings since git replaces CRLF with LF when a file is recognized as text. - language: Gerber Image
pattern: '^[DGMT][0-9]{2}\*\r?\n' And what do we do for the cases with only - language: G-code
pattern: '^[MG][0-9]+\n' |
Yes, you're absolutely right we should. This goes for heuristics that naïvely match UPDATE: Never mind, this doesn't actually work consistently between engines after all (see @jorendorff's correction below). Leaving the below rambling as-is for transparency's sake… (?<ignore_lol>) - language: Gerber Image
- pattern: '^[DGMT][0-9]{2}\*\r?\n'
+ pattern: '^[DGMT][0-9]{2}\*$'
- language: G-code
- pattern: '^[MG][0-9]+\n'
+ pattern: '^[MG][0-9]+$' It's a different story if something else needs to be matched after the newline, though; i.e., if you're trying to match something like a YAML version header: ^%YAML 1\.2(?:\r?\n|\r)---$ |
Unfortunately, EDIT: If I'm reading this right, it doesn't work in Ruby either: irb(main):001:0> $x = "xyz\r\n"
=> "xyz\r\n"
irb(main):002:0> /xyz$/.match($x)
=> nil |
You're right. 😓 (Note to self: Stop using Firefox's console to test regular expressions…) Alright, scratch everything after the second paragraph in my last reply. Just stick with |
OK. Each place where I changed Changing all other places where |
I guess you meant |
Yes, that's right. Sorry for the error. |
No worries. Everything looks good in terms of the replacement of However, I think the second half of @Alhadis' request for changes wasn't addressed yet:
|
This affects three regexps I previously touched to remove `(?m)`. `(?:.|\r)` matches any character in most languages, but not in JS, so this commit switches to `(?:.|[\r\n])`.
Sorry, I missed that bit. Fixed now. |
Agreed. For now, let's keep this one atomic. |
@@ -675,7 +675,7 @@ disambiguations: | |||
- extensions: ['.stl'] | |||
rules: | |||
- language: STL | |||
pattern: '\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)' | |||
pattern: '\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooof!! This new regex is causing timeouts on GitHub.com. Why? Because it suffers from catastrophic backtracking with larger files... precisely where this change is as can be seen at https://regex101.com/r/2JKkxf/1 (contrived example using the sample we have)
The old regex doesn't hit this problem as it doesn't seem to be matching the same thing and determines this much sooner 😁 - https://regex101.com/r/gBFDDP/1
/cc @Alhadis @jorendorff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something obvious, but ^solid[\s\S]*endsolid$
may do the trick. If we want to keep things closer to as they currently are, \A\s*solid[\s\S]*endsolid(?:$|\s)
does the trick too with my sample and several of the customer files found tripping things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\A\s*solid[\s\S]*^endsolid(?:$|\s)
keeps things closer to the original by requiring endsolid
to be on a new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the old regex doesn't match anything in regex101 is that regex101 is using PCRE, not Ruby, and the old regex uses Ruby-specific features that PCRE doesn't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both the old regex and the new one on a 5000-line file.
I used this Ruby script.
require 'benchmark'
LAST_GOOD_RE = /\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)/
FIRST_BAD_RE = /\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)/
text = File.read("SV05_bed.stl")
t = Benchmark.measure {
LAST_GOOD_RE.match?(text)
}
puts t.real
t = Benchmark.measure {
FIRST_BAD_RE.match?(text)
}
puts t.real
If I'm reading this right, the old regex runs in 2ms and the new one runs in 4ms, on my laptop. So it really is slower.
Mad props for simplifying regexps syntax and making it portable across the libraries! |
Description
Hi. My team is porting Linguist to Rust, and we'd like to reuse the regular expressions in heuristics.yml.
They contain a few Rubyisms—a fairly small number relative to the whole file, but enough that we have to deal with them somehow.
We'd like to upstream the changes, in the hope that other projects (like go-enry) can benefit. These changes make the regexps more compatible with languages like JavaScript, Go, Python, and Rust, without affecting the behavior.
In this PR:
{
to\{
in many rules. Curly braces are special characters in regexps. The Ruby docs actually say that they have to be escaped, but the implementation silently allows it if you don't. Other languages are stricter, though.\<
to<
in two rules.\R
to\r?\n
in two rules, and reword a third to avoid it. The meaning of\R
in Ruby includes a few more characters than\r?\n
, so this is a slight change in behavior, but it's very unlikely that any files of these particular types are using\v
or\f
, or Unicode paragraph separators, etc. as line breaks in practice.(?m)
because it's not portable — it means one thing in Ruby and a totally different thing in Python, Rust, and JavaScript. The behavior of the regexps has not been changed, though; when removing(?m)
I also replaced.
with(?:.|\n)
or some other equivalent formula. (This didn't make any regexps much longer, happily.)This doesn't make all of the regexes portable to all languages, but it's an easy big improvement.
Checklist:
N/A