Process immense terms during database initialisation #17

To-om · 2021-12-14T09:31:02Z

Request Type

Bug

Problem Description

The index engine fails to process the document if it contains a "non full-text" field with more than 32766 bytes.

During document creation, the document won't be indexed and become invisible (even if it stored in the database).
During a data reindex, the process stops and a part of the data is not indexed.

A huge data can break the application.

Solution

During database initialisation, add a process that finds immense terms and fixes them. Several strategies can be applied:

truncate: truncate the data
delete: remove the document
log: show the document in logs

A custom strategy (store the data in a file storage for example) can also be considered but it cannot be implemented in Scalligraph.

The process requires a full scan of the database (because the index cannot be used). It is triggered only if the configuration is present. The configuration consists of field name and the strategy to apply on. The strategy can contain an optional parameter which define the size threshold in characters (a character may occupy 4 bytes in UTF-8).

db.janusgraph {
  immenseTermProcessing: {
    data:  "delete(2048)"   // Delete document that contains a field "data" with size greater that 2048
    title: "truncate(4096)" // Truncate the field "title"
    name:  "truncate"       // Truncate the field "name" (default threshold is 8191)
  }
}

IMPORTANT The configuration should be present only for one startup to fix the data. It should be removed as soon as the process if finished.

The text was updated successfully, but these errors were encountered:

To-om added bug Something isn't working enhancement New feature or request labels Dec 14, 2021

To-om self-assigned this Dec 14, 2021

This was referenced Dec 14, 2021

[Enhancement] When observable data is too big, use hash TheHive-Project/TheHive#2288

Closed

[Bug] Index fails with immense terms TheHive-Project/TheHive#2289

Closed

To-om added a commit that referenced this issue Dec 14, 2021

#17 Ad processing for immense term

c1a5fb6

To-om closed this as completed Dec 14, 2021

To-om added a commit that referenced this issue Dec 14, 2021

#17 Add log strategy for immense term processor (for dry-run)

3891a43

jeromeleonard added the contains-docs label Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process immense terms during database initialisation #17

Process immense terms during database initialisation #17

To-om commented Dec 14, 2021 •

edited

Loading

Process immense terms during database initialisation #17

Process immense terms during database initialisation #17

Comments

To-om commented Dec 14, 2021 • edited Loading

Request Type

Problem Description

Solution

To-om commented Dec 14, 2021 •

edited

Loading