Make regexps in `heuristics.yml` more portable #6417

jorendorff · 2023-05-18T22:00:30Z

Description

Hi. My team is porting Linguist to Rust, and we'd like to reuse the regular expressions in heuristics.yml.

They contain a few Rubyisms—a fairly small number relative to the whole file, but enough that we have to deal with them somehow.

We'd like to upstream the changes, in the hope that other projects (like go-enry) can benefit. These changes make the regexps more compatible with languages like JavaScript, Go, Python, and Rust, without affecting the behavior.

In this PR:

Change { to \{ in many rules. Curly braces are special characters in regexps. The Ruby docs actually say that they have to be escaped, but the implementation silently allows it if you don't. Other languages are stricter, though.
Change \< to < in two rules.
Change \R to \r?\n in two rules, and reword a third to avoid it. The meaning of \R in Ruby includes a few more characters than \r?\n, so this is a slight change in behavior, but it's very unlikely that any files of these particular types are using \v or \f, or Unicode paragraph separators, etc. as line breaks in practice.
Remove uses of (?m) because it's not portable — it means one thing in Ruby and a totally different thing in Python, Rust, and JavaScript. The behavior of the regexps has not been changed, though; when removing (?m) I also replaced . with (?:.|\n) or some other equivalent formula. (This didn't make any regexps much longer, happily.)

This doesn't make all of the regexes portable to all languages, but it's an easy big improvement.

Checklist:

N/A

This replaces `\R` with `\r?\n` in two places, which actually slightly changes the meaning, but I believe the regexps were using `\R` because that particular file type is common on Windows. In a third place, `\R` is replaced with `^`. This works because it follows `(?m:.*?)`, which matches newline characters.

This flag means something different in Ruby vs. other languages, including JavaScript.

lildude

Makes sense to me. I'll merge this when I make the next release, hopefully in the next few weeks, depending on "you know what" 😉 @jorendorff.

Heads up @Alhadis as you deal with a lot of the regex queries and reviews.

Alhadis

This PR introduces subtle incompatibilities regarding newlines:

Change \R to \r?\n

This should actually be (?:\r?\n|\r), so that CR line-endings are handled consistently between programming environments.

I also replaced . with (?:.|\n) or some other equivalent formula.

(?:.|\n) won't match carriage returns in CRLF endings, so I recommend using (?:.|[\r\n]) or (?:\s|\S) instead.

DecimalTurn · 2023-05-23T18:04:30Z

This should actually be (?:\r?\n|\r), so that CR line-endings are handled consistently between programming environments.

Note that for INI and VB6, since they are Windows specific, we don't really have to worry about the lone CR line-ending case.

Alhadis · 2023-05-23T18:19:27Z

Note that for INI and VB6, since they are Windows specific, we don't really have to worry about the lone CR line-ending case.

One could make the same argument about LF endings on Windows (which uses CRLF). However, it's better that we enforce a consistent strategy for matching newlines in our heuristics, as it eliminates this sort of headache.

DecimalTurn · 2023-05-23T18:49:25Z

One could make the same argument about LF endings on Windows (which uses CRLF). However, it's better that we enforce a consistent strategy for matching newlines in our heuristics, as it eliminates this sort of headache.

It's different for LF endings since git replaces CRLF with LF when a file is recognized as text.
However, I can get behind using (?:\r?\n|\r) everywhere if our goal is to have a constistent strategy for line endings, but we should then also replace existing instances of \r?\n such as :

  - language: Gerber Image
    pattern: '^[DGMT][0-9]{2}\*\r?\n'

And what do we do for the cases with only \n like the following?

  - language: G-code
    pattern: '^[MG][0-9]+\n'

Alhadis · 2023-05-23T19:05:00Z

but we should then also replace existing instances of \r?\n

Yes, you're absolutely right we should. This goes for heuristics that naïvely match \n instead of something more platform-agnostic (I confess I'm guilty of making this mistake all the time…).

UPDATE: Never mind, this doesn't actually work consistently between engines after all (see @jorendorff's correction below). Leaving the below rambling as-is for transparency's sake…

(?<ignore_lol>)

However, if the heuristic matches a newline as the final part of a pattern, then it can be replaced with `$` instead:
- language: Gerber Image - pattern: '^[DGMT][0-9]{2}\*\r?\n' + pattern: '^[DGMT][0-9]{2}\*$' - language: G-code - pattern: '^[MG][0-9]+\n' + pattern: '^[MG][0-9]+$'

It's a different story if something else needs to be matched after the newline, though; i.e., if you're trying to match something like a YAML version header:

^%YAML 1\.2(?:\r?\n|\r)---$

jorendorff · 2023-05-23T21:45:01Z

However, if the heuristic matches a newline as the final part of a pattern, then it can be replaced with $ instead:

Unfortunately, $ does not match before \r in Python, Perl, or Rust. It works in JS, but it's not portable. :-(

EDIT: If I'm reading this right, it doesn't work in Ruby either:

irb(main):001:0> $x = "xyz\r\n"
=> "xyz\r\n"
irb(main):002:0> /xyz$/.match($x)
=> nil

Alhadis · 2023-05-23T22:18:42Z

Unfortunately, $ does not match before \r in Python, Perl, or Rust. It works in JS, but it's not portable. :-(

You're right. 😓 (Note to self: Stop using Firefox's console to test regular expressions…)

Alright, scratch everything after the second paragraph in my last reply. Just stick with (?:\r?\n|\r) like God intended.

jorendorff · 2023-05-24T15:41:47Z

OK. Each place where I changed ~~\N~~ \R to \r?\n (only 2 patterns), I've now changed it to (?:\r?\n|\r).

Changing all other places where \n appears is a bigger change. Let me know if you want that. I think it should be a separate PR.

DecimalTurn · 2023-05-24T16:08:11Z

I guess you meant \R and not \N.

jorendorff · 2023-05-25T14:05:04Z

Yes, that's right. Sorry for the error.

DecimalTurn · 2023-05-25T15:26:04Z

No worries. Everything looks good in terms of the replacement of \R instances and I agree that we can wait for another PR to apply our new line-endings strategy for the rest of the file.

However, I think the second half of @Alhadis' request for changes wasn't addressed yet:

I also replaced . with (?:.|\n) or some other equivalent formula.

(?:.|\n) won't match carriage returns in CRLF endings, so I recommend using (?:.|[\r\n]) or (?:\s|\S) instead.

This affects three regexps I previously touched to remove `(?m)`. `(?:.|\r)` matches any character in most languages, but not in JS, so this commit switches to `(?:.|[\r\n])`.

jorendorff · 2023-05-26T18:03:36Z

Sorry, I missed that bit. Fixed now.

Alhadis · 2023-05-28T22:03:35Z

Changing all other places where \n appears is a bigger change. […] I think it should be a separate PR.

Agreed. For now, let's keep this one atomic.

lildude · 2023-08-16T09:00:14Z

lib/linguist/heuristics.yml

@@ -675,7 +675,7 @@ disambiguations:
 - extensions: ['.stl']
  rules:
  - language: STL
-    pattern: '\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)'
+    pattern: '\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)'


Oooof!! This new regex is causing timeouts on GitHub.com. Why? Because it suffers from catastrophic backtracking with larger files... precisely where this change is as can be seen at https://regex101.com/r/2JKkxf/1 (contrived example using the sample we have)

The old regex doesn't hit this problem as it doesn't seem to be matching the same thing and determines this much sooner 😁 - https://regex101.com/r/gBFDDP/1

/cc @Alhadis @jorendorff

Maybe I'm missing something obvious, but ^solid[\s\S]*endsolid$ may do the trick. If we want to keep things closer to as they currently are, \A\s*solid[\s\S]*endsolid(?:$|\s) does the trick too with my sample and several of the customer files found tripping things up.

\A\s*solid[\s\S]*^endsolid(?:$|\s) keeps things closer to the original by requiring endsolid to be on a new line.

The reason the old regex doesn't match anything in regex101 is that regex101 is using PCRE, not Ruby, and the old regex uses Ruby-specific features that PCRE doesn't understand.

I tried both the old regex and the new one on a 5000-line file.

I used this Ruby script.

require 'benchmark' LAST_GOOD_RE = /\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)/ FIRST_BAD_RE = /\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)/ text = File.read("SV05_bed.stl") t = Benchmark.measure { LAST_GOOD_RE.match?(text) } puts t.real t = Benchmark.measure { FIRST_BAD_RE.match?(text) } puts t.real

If I'm reading this right, the old regex runs in 2ms and the new one runs in 4ms, on my laptop. So it really is slower.

bzz · 2023-09-06T21:02:17Z

Mad props for simplifying regexps syntax and making it portable across the libraries!
@jorendorff you probably know all about it already, but internally we have teams using https://github.com/go-enry/rs-enry to their satisfaction.

jorendorff added 4 commits May 18, 2023 13:27

Escape curly braces in regexps, for portability.

ca99e76

Replace \< with < in regexps, for portability.

455c712

Remove some uses of (?m) from regexps, for portability.

a0a611b

This flag means something different in Ruby vs. other languages, including JavaScript.

jack2107 approved these changes May 18, 2023

View reviewed changes

lildude requested review from lildude and Alhadis May 19, 2023 07:19

lildude approved these changes May 22, 2023

View reviewed changes

Alhadis requested changes May 23, 2023

View reviewed changes

Alhadis changed the title ~~Make regexps in heuristics.yml more portable~~ Make regexps in heuristics.yml more portable May 23, 2023

DecimalTurn mentioned this pull request May 23, 2023

WIP add heuristic for unsupported regex syntax go-enry/go-enry#160

Closed

Change \N replacement to (?:\r?\n|\r).

bead704

Change "match any char" patterns to also match \r.

2cb1011

This affects three regexps I previously touched to remove `(?m)`. `(?:.|\r)` matches any character in most languages, but not in JS, so this commit switches to `(?:.|[\r\n])`.

Alhadis approved these changes May 28, 2023

View reviewed changes

Alhadis and others added 2 commits May 29, 2023 08:03

Merge branch 'master' into jorendorff/portable-regexps

068847d

Merge branch 'master' into jorendorff/portable-regexps

45b0ef9

lildude requested a review from a team as a code owner May 30, 2023 09:08

lildude approved these changes May 30, 2023

View reviewed changes

lildude added this pull request to the merge queue May 30, 2023

Merged via the queue into github-linguist:master with commit ae78fc7 May 30, 2023

lildude reviewed Aug 16, 2023

View reviewed changes

lildude mentioned this pull request Aug 16, 2023

Harden heuristics against Regexp::TimeoutError errors #6518

Merged

2 tasks

DecimalTurn mentioned this pull request Aug 30, 2023

Platform agnostic line endings #6530

Merged

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make regexps in `heuristics.yml` more portable #6417

Make regexps in `heuristics.yml` more portable #6417

jorendorff commented May 18, 2023 •

edited

Loading

lildude left a comment

Alhadis left a comment

DecimalTurn commented May 23, 2023

Alhadis commented May 23, 2023

DecimalTurn commented May 23, 2023

Alhadis commented May 23, 2023 •

edited

Loading

jorendorff commented May 23, 2023 •

edited

Loading

Alhadis commented May 23, 2023

jorendorff commented May 24, 2023 •

edited

Loading

DecimalTurn commented May 24, 2023

jorendorff commented May 25, 2023

DecimalTurn commented May 25, 2023 •

edited

Loading

jorendorff commented May 26, 2023

Alhadis commented May 28, 2023

lildude Aug 16, 2023

lildude Aug 16, 2023

lildude Aug 16, 2023

jorendorff Aug 25, 2023

jorendorff Aug 25, 2023

bzz commented Sep 6, 2023 •

edited

Loading

Make regexps in heuristics.yml more portable #6417

Make regexps in heuristics.yml more portable #6417

Conversation

jorendorff commented May 18, 2023 • edited Loading

Description

Checklist:

lildude left a comment

Choose a reason for hiding this comment

Alhadis left a comment

Choose a reason for hiding this comment

DecimalTurn commented May 23, 2023

Alhadis commented May 23, 2023

DecimalTurn commented May 23, 2023

Alhadis commented May 23, 2023 • edited Loading

jorendorff commented May 23, 2023 • edited Loading

Alhadis commented May 23, 2023

jorendorff commented May 24, 2023 • edited Loading

DecimalTurn commented May 24, 2023

jorendorff commented May 25, 2023

DecimalTurn commented May 25, 2023 • edited Loading

jorendorff commented May 26, 2023

Alhadis commented May 28, 2023

lildude Aug 16, 2023

Choose a reason for hiding this comment

lildude Aug 16, 2023

Choose a reason for hiding this comment

lildude Aug 16, 2023

Choose a reason for hiding this comment

jorendorff Aug 25, 2023

Choose a reason for hiding this comment

jorendorff Aug 25, 2023

Choose a reason for hiding this comment

bzz commented Sep 6, 2023 • edited Loading

Make regexps in `heuristics.yml` more portable #6417

Make regexps in `heuristics.yml` more portable #6417

jorendorff commented May 18, 2023 •

edited

Loading

Alhadis commented May 23, 2023 •

edited

Loading

jorendorff commented May 23, 2023 •

edited

Loading

jorendorff commented May 24, 2023 •

edited

Loading

DecimalTurn commented May 25, 2023 •

edited

Loading

bzz commented Sep 6, 2023 •

edited

Loading