Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image can loose the "is_section_image" status #312

Closed
kam193 opened this issue Jan 29, 2025 · 4 comments
Closed

Image can loose the "is_section_image" status #312

kam193 opened this issue Jan 29, 2025 · 4 comments
Assignees
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working

Comments

@kam193
Copy link

kam193 commented Jan 29, 2025

Describe the bug
I'm struggling with an issue that I see quite often "broken" image instead of the image previews (in all services generating previews - including document previews, images seen in PE files etc.). It doesn't happen always, and I've noticed I see it also with images that (I'm almost sure) were previously rendered correctly.

I thought that images were just gone from the datastore or so, but I've noticed HTTP 403 errors: This file is not allowed to be downloaded as a datastream. This is generated only here: https://github.com/CybercentreCanada/assemblyline-ui/blob/167c2001833fbc1b9629c57d5db248b576f398ef/assemblyline_ui/api/v4/file.py#L499 if the image doesn't have the is_section_image flag.

Digging into it, in the save_or_freshen_file method (https://github.com/CybercentreCanada/assemblyline-base/blob/be3bd33c3f2d22032c9e06aa0eafdbb1f2b56a48/assemblyline/datastore/helper.py#L1186), in the loop in L1208, most file attributes are updated to the latest state. If the newer fileinfo definition has is_section_image set to false, it will overwrite the earlier state.

I've found so far one (veeery rare) edge case when it happens: if you upload the generated preview picture for the analysis. I don't believe it's the issue I see, but it proves that something can happen.

To Reproduce
Steps to reproduce the behavior:

  1. Upload a file that would get a preview, e.g. an image.
  2. See the preview.
  3. Save the preview image and upload as a submission
  4. When the submission is done, go back to the previous one and see that the preview is gone.

Expected behavior
Old preview stays.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information if pertinent):

  • Assemblyline Version: 4.5.0.stable68, multiple services in the privileged mode (maybe important)

Additional context
This is really an edge case. I'm still looking into why it happens at scale, I have some suspicious that maybe e.g. expiration process somehow overwrites the flag, but I haven't found any real way why yet.

@kam193 kam193 added assess We still haven't decided if this will be worked on or not bug Something isn't working labels Jan 29, 2025
@cccs-nr cccs-nr self-assigned this Feb 3, 2025
@cccs-rs
Copy link
Contributor

cccs-rs commented Feb 9, 2025

I know I've seen this issue before in production (which I thought was patched) and I think it's somewhat related to: #232

Upon some cursory research on the subject, it is possible that at scale the document's state isn't consistent across workloads based on the information that it gets from Elasticsearch (especially if a lot of operations are happening to the document at a given point in time): https://stackoverflow.com/questions/25030726/elasticsearch-does-not-return-existing-document-sometimes

We could set the refresh settings to be much shorter or force a refresh upon index (refresh="wait_for") but I believe setting that for every index operation could tank performance and affect scalability for high volume workloads.

There are suggestions to use upserts or doc_as_upsert, but I'm not sure if it'll run into the same issues with respect to when the shards are refreshed: https://stackoverflow.com/questions/56831080/elasticsearch-search-sometime-fails-to-search-a-document


Upon inspecting the index settings, I noticed that we don't set anything for the refresh interval:
https://github.com/CybercentreCanada/assemblyline-base/blob/master/assemblyline/datastore/support/schemas.py#L1

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings
"If this setting is not explicitly set, shards that haven’t seen search traffic for at least index.search.idle.after seconds will not receive background refreshes until they receive a search request. Searches that hit an idle shard where a refresh is pending will trigger a refresh as part of the search operation for that shard only."

So in theory even if the shard was idle, it should've performed a refresh because we fetch the document, if it exists, to compare the is_section_image field with the document we're about to index.. 🤔


One thing we could do at the API-level is perform a search across the Result index to see if the file in question has been identified as being a section image (on the condition that the file document indicates that it isn't a section image) and patch the document.

ie.

    if not file_obj['type'].startswith("image/"):
        return make_api_response({}, "This file is not allowed to be downloaded as a datastream.", 403)
    else:
        if not file_obj.get('is_section_image', False):
            # Are we sure this file isn't section image?
            if STORAGE.result.search(
                query=f"response.supplementary.sha256:{sha256} AND response.supplementary.is_section_image:true",
                rows=0, track_total_hits=True,
                as_obj=False)['total']:
                    # Patch file object with the correct information
                    file_obj["is_section_image"] = True
                    STORAGE.file.save(sha256, file_obj)

        # If the file has been confirmed to not be a section image at this time, block download
        if not file_obj.get('is_section_image', False):
            return make_api_response({}, "This file is not allowed to be downloaded as a datastream.", 403)

To mind, I don't think there would be an service that would drop an image as supplementary file and for that file to not be a section image in the same result record.

For example, if I wanted to check if hash2 was an image section, I believe the query above would hit on a record like this:

{
  "response": {
    "supplementary": [
     {"sha256": "hash1", "is_image_section": true},
     {"sha256": "hash2", "is_image_section": false},
    ]
  }
}

In which case, there would have to be more inspection involved for confirmation:

    if not file_obj['type'].startswith("image/"):
        return make_api_response({}, "This file is not allowed to be downloaded as a datastream.", 403)
    else:
        if not file_obj.get('is_section_image', False):
            # Are we sure this file isn't section image?
            for result in STORAGE.result.search(
                query=f"response.supplementary.sha256:{sha256} AND response.supplementary.is_section_image:true",
                fl="response.supplementary.sha256,response.supplementary.is_section_image",
                as_obj=False)['items']:

                # Confirm that the file mentioned is a section image by inspecting the results
                for file in result["response"]["supplmentary"]:
                    if file["sha256"] == sha256 and file["is_section_image"] == True:
                        # Patch file object with the correct information
                        file_obj["is_section_image"] = True
                        STORAGE.file.save(sha256, file_obj)
                        break

                if file_obj["is_section_image"] == True:
                    # File has been updated, no need to iterate over search results anymore
                    break

        # If the file has been confirmed to not be a section image at this time, block download
        if not file_obj.get('is_section_image', False):
            return make_api_response({}, "This file is not allowed to be downloaded as a datastream.", 403)

@cccs-douglass thoughts?

@gdesmar
Copy link

gdesmar commented Feb 10, 2025

To mind, I don't think there would be an service that would drop an image as supplementary file and for that file to not be a section image in the same result record.

URLDownloader is the service that comes to mind. It will add all images (non-svg) as supplementary files, but not to be used as a section image.
In the rare case where there are more than the MAX_EXTRACTED number of files in an archive, Extract will prioritize certain file types. Images are low on the list, and will have more chances to end up being added as supplementary files, which would not be used as a section image.

@cccs-rs
Copy link
Contributor

cccs-rs commented Feb 10, 2025

We'll work on a patch that states once a file has been identified as a section image, it will remain a section image forever (or until it's been expired).

@cccs-rs
Copy link
Contributor

cccs-rs commented Feb 11, 2025

This should be resolved in the 4.5.0.74 release along with services rebuilt under this release.

@cccs-rs cccs-rs closed this as completed Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants