Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is expected behavior when input is not valid utf8? #21

Closed
dmitshur opened this issue Aug 22, 2015 · 6 comments
Closed

What is expected behavior when input is not valid utf8? #21

dmitshur opened this issue Aug 22, 2015 · 6 comments
Assignees

Comments

@dmitshur
Copy link

I tried looking at the docs for this package, but the DiffMain method simply says:

DiffMain finds the differences between two texts.

So I'm not sure how it's supposed to handle input that contains invalid utf8 sequences.

Here's how it handles it right now:

package main

import (
    "fmt"
    "unicode/utf8"

    "github.com/sergi/go-diff/diffmatchpatch"
)

func main() {
    var inputs = []string{
        "a1234567890z",
        "Hello 世界",
        "a\xe0\xe5\xf0\xe9\xe1\xf8\xf1\xe9\xe8\xe4Z",
    }

    for _, in := range inputs {
        fmt.Printf("input: %q\n(length %v bytes)\nutf8.Valid: %v\n", in, len(in), utf8.ValidString(in))

        dmp := diffmatchpatch.New()
        diffs := dmp.DiffMain(in, "", true)

        fmt.Printf("diff text: %q\n(length %v bytes)\n\n", diffs[0].Text, len(diffs[0].Text))
    }
}

Output:

input: "a1234567890z"
(length 12 bytes)
utf8.Valid: true
diff text: "a1234567890z"
(length 12 bytes)

input: "Hello 世界"
(length 12 bytes)
utf8.Valid: true
diff text: "Hello 世界"
(length 12 bytes)

input: "a\xe0\xe5\xf0\xe9\xe1\xf8\xf1\xe9\xe8\xe4Z"
(length 12 bytes)
utf8.Valid: false
diff text: "a����������Z"
(length 32 bytes)

In the case where input is not valid utf8, the length of output, in bytes, is not the same as input (12 bytes vs. 32 bytes).

Is that expected behavior?

If so, is there a way I can use diffmatchpatch in such a way that it gives me a diff on a byte-level, meaning the length of output, in bytes, should match that of input (aside from pre-processing the input to not contain invalid utf8 sequences)?

@zimmski
Copy link
Collaborator

zimmski commented Oct 29, 2016

I tested this on the latest master. The output is still the same.

@shurcooL: I guess what you mean is that the value for "diff text" should be the same as for "input"?

@dmitshur
Copy link
Author

I don't know if it should, but that would be one way of resolving the issue of this behavior being undocumented and unclear (I don't know if there's an issue or not here...).

The issue is that I don't know the answer to the following question:

Is that expected behavior?

@zimmski
Copy link
Collaborator

zimmski commented Nov 1, 2016

Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?

I hardly encounter invalid utf8 (usually it is a transformation issue from one encoding to the other for me) but it is certainly the first time I need to handle it in Go. So I will document my findings a little for people who search the net with the same problem.

First of all, the output with the � characters and the 32 bytes length "could be" expected since the characters were substituted by Go while processing the invalid values. The character is the value of unicode/utf8.RuneError (value in bits 1111111111111101 octal 177775 hex FFFD, the character simply occupies 3 bytes in utf8) and is better know as the Unicode replacement character. Which is a substitution for invalid data if you iterate over a string rune/unicode-code-point-wise either with the range expression or with one of the functions in unicode/utf8. This is also defined in the Go spec. And since there is no error return argument for these functions, it is the only way of showing that an error occurred. In this case it is invalid data for each of this invalid code points. DiffMain is just an alias for DiffMainRunes, so we are iterating in the same sense as the range expression over both texts and diffing them. I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError since this is the spec-defined behavior for invalid utf8 unicode code points in Go. So one change for this issue would be to document this like it is documented in the unicode/utf8 package.

Now to the second problem which is the "diff on a byte-level". Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once. However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:

  • Take two byte slices as parameters
  • Diff them using the internal code for runes
  • The outcome would have unicode/utf8.RuneError in it if there are invalid code points
  • Transform the result back to the original invalid code points using the original two byte slices

This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.

What do you think of these two changes? Do they make sense not just in your use case but in general?

@sergi: Maybe you could take a look at this too.

@dmitshur
Copy link
Author

dmitshur commented Nov 6, 2016

Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?

Sure. But first, let me just mentioned how I expected this issue to be resolved. As you mentioned, invalid UTF-8 is quite rare, and I would argue there's a good chance it's better to make it something that go-diff does not deal with. It can be taken care of by the user in pre-processing, before calling go-diff funcs.

So, if that's the approach you choose to go with, this issue can be resolved simply by documenting that input must be valid UTF-8:

// DiffMain finds the differences between two texts.
// text1, text2 must be valid UTF-8 strings.
func (dmp *DiffMatchPatch) DiffMain(text1, text2 string, checklines bool) []Diff {

@dmitshur
Copy link
Author

dmitshur commented Nov 6, 2016

What do you think of these two changes? Do they make sense not just in your use case but in general?

I'll post my thoughts below.

I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError since this is the spec-defined behavior for invalid utf8 unicode code points in Go.

That makes sense to me.

So one change for this issue would be to document this like it is documented in the unicode/utf8 package.

Can you elaborate on how it'd be documented? I think this would be good, yeah.

Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once.

Makes sense, I agree. I don't think you should try to change that.

However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:

  • Take two byte slices as parameters
  • Diff them using the internal code for runes
  • The outcome would have unicode/utf8.RuneError in it if there are invalid code points
  • Transform the result back to the original invalid code points using the original two byte slices

This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.

Hmm. I think this sounds good. I'm not sure how easy the "Transform the result back to the original invalid code points using the original two byte slices" step is. But I'm not very familiar with the internal code of this package, so I'll take your word on that doing this would be "rather easy".

I'll mention that I personally don't have a concrete need for this now, so you should only implement this if you think it's worth it for other users/in general.

@zimmski
Copy link
Collaborator

zimmski commented Dec 2, 2016

Since I am currently short on time I will open an issue for the implementation of DiffMainBytes and will just document the behavior for now.

@zimmski zimmski closed this as completed in dda5e6e Dec 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants