-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is expected behavior when input is not valid utf8? #21
Comments
I tested this on the latest master. The output is still the same. @shurcooL: I guess what you mean is that the value for "diff text" should be the same as for "input"? |
I don't know if it should, but that would be one way of resolving the issue of this behavior being undocumented and unclear (I don't know if there's an issue or not here...). The issue is that I don't know the answer to the following question:
|
Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you? I hardly encounter invalid utf8 (usually it is a transformation issue from one encoding to the other for me) but it is certainly the first time I need to handle it in Go. So I will document my findings a little for people who search the net with the same problem. First of all, the output with the � characters and the 32 bytes length "could be" expected since the characters were substituted by Go while processing the invalid values. The character is the value of Now to the second problem which is the "diff on a byte-level". Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once. However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:
This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form. What do you think of these two changes? Do they make sense not just in your use case but in general? @sergi: Maybe you could take a look at this too. |
Sure. But first, let me just mentioned how I expected this issue to be resolved. As you mentioned, invalid UTF-8 is quite rare, and I would argue there's a good chance it's better to make it something that So, if that's the approach you choose to go with, this issue can be resolved simply by documenting that input must be valid UTF-8: // DiffMain finds the differences between two texts.
// text1, text2 must be valid UTF-8 strings.
func (dmp *DiffMatchPatch) DiffMain(text1, text2 string, checklines bool) []Diff { |
I'll post my thoughts below.
That makes sense to me.
Can you elaborate on how it'd be documented? I think this would be good, yeah.
Makes sense, I agree. I don't think you should try to change that.
Hmm. I think this sounds good. I'm not sure how easy the "Transform the result back to the original invalid code points using the original two byte slices" step is. But I'm not very familiar with the internal code of this package, so I'll take your word on that doing this would be "rather easy". I'll mention that I personally don't have a concrete need for this now, so you should only implement this if you think it's worth it for other users/in general. |
Since I am currently short on time I will open an issue for the implementation of DiffMainBytes and will just document the behavior for now. |
I tried looking at the docs for this package, but the
DiffMain
method simply says:So I'm not sure how it's supposed to handle input that contains invalid utf8 sequences.
Here's how it handles it right now:
Output:
In the case where input is not valid utf8, the length of output, in bytes, is not the same as input (12 bytes vs. 32 bytes).
Is that expected behavior?
If so, is there a way I can use
diffmatchpatch
in such a way that it gives me a diff on a byte-level, meaning the length of output, in bytes, should match that of input (aside from pre-processing the input to not contain invalid utf8 sequences)?The text was updated successfully, but these errors were encountered: