Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use batched NSRL insertion #58

Merged
merged 4 commits into from
Feb 7, 2025
Merged

Conversation

BolidCyber
Copy link
Contributor

With the current NSRL insertion implementation of the update server, the server needs to load the entire NSRL database in memory. This can make the service crash if not enough RAM is available.

Using pandas, it is possible to load n lines of a CSV file using a generator, which will save a lot of memory.

@BolidCyber
Copy link
Contributor Author

BolidCyber commented Jan 20, 2025

Pipeline seems to be broken:

  • CybercentreCanada/assemblyline-pipeline-templates could not be found
  • CybercentreCanada/assemblyline-unittest-samples could not be found

There is no public access to those repos.

@cccs-rs cccs-rs self-assigned this Jan 20, 2025
@BolidCyber BolidCyber marked this pull request as ready for review January 20, 2025 14:54
@cccs-rs
Copy link
Contributor

cccs-rs commented Jan 20, 2025

Pipeline seems to be broken:

  • CybercentreCanada/assemblyline-pipeline-templates could not be found
  • CybercentreCanada/assemblyline-unittest-samples could not be found

There is no public access to those repos.

Correct, although if the PR is ready for review, I think I should be able to trigger the pipeline manually for testing
Based on last commit: Pipelines

@cccs-rs cccs-rs requested a review from gdesmar January 20, 2025 15:13
@cccs-rs
Copy link
Contributor

cccs-rs commented Jan 20, 2025

An alternative to using pandas to lazy load CSV data: https://pypi.org/project/lazycsv?

Just in the interest of minimizing our footprint if we can 😅

@BolidCyber
Copy link
Contributor Author

An alternative to using pandas to lazy load CSV data: https://pypi.org/project/lazycsv?

Just in the interest of minimizing our footprint if we can 😅

I do agree with that stance, pandas might not be the best library for this work. I'll try and make a port.

@BolidCyber BolidCyber marked this pull request as draft January 20, 2025 16:14
@BolidCyber
Copy link
Contributor Author

BolidCyber commented Jan 20, 2025

An alternative to using pandas to lazy load CSV data: https://pypi.org/project/lazycsv?

Just in the interest of minimizing our footprint if we can 😅

Could we respecify the ticket ?
Actually, lazy loading is already implemented through the CSV reader. There's no need of a new library (nor pandas). I think that when we analyzed the bug (several months from now), we overthought it.

Reverting commit da56c7f should suffice as the issue is in the indefinitely growing hash_list variable, not the CSV read.

@BolidCyber BolidCyber changed the title Use pandas for batched NSRL insertion Use batched NSRL insertion Jan 20, 2025
@BolidCyber BolidCyber marked this pull request as ready for review January 20, 2025 16:39
@cccs-rs cccs-rs requested a review from gdesmar January 21, 2025 15:24
@cccs-rs cccs-rs merged commit a84feab into CybercentreCanada:master Feb 7, 2025
1 check failed
@cccs-rs
Copy link
Contributor

cccs-rs commented Feb 7, 2025

Patch should be included in the latest release! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants