Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thehive4 with Cassandra very slow compared to TH3.4 on the same data #1341

Closed
martinr103 opened this issue May 18, 2020 · 17 comments
Closed
Assignees
Labels

Comments

@martinr103
Copy link

Bug?

Work Environment

Question Answer
OS version (server) Debian
TheHive version / git hash 4.0RC2
Package Type Binary
Browser type & version Firefox 76

Problem Description

I observe a massive performance degradation with TH4/Cassandra, compared to TH3.4/Elastic.
And I would like to ask here if other people have similar experiences, or am I "alone" with this issue.

We run the Test-Instance of Cassandra and TH4 as Docker containers, data has been imported from the production environment (TH3.4) using the provided document.
There is a little less then 5000 cases, each case has a "handful" of observables (like 1.. 2... 5) -- none of the cases has more than 20 obs.
So all in all, this is in my opinion a rather small amount of data to be handled by a database.

In fact, our production instance (TH3.4) runs and "feels" very fast in the UI. Loading the list of all cases (or say, by quick filters, "only closed cases"... "only open cases") is instant. No delay whatsoever.
Also opening a single case, checking its details, checking its observables - goes instantly.

Compared to this, the TH4 UI "feels" terribly slow. Loading the list of cases takes 3+ seconds.
Applying any filters on the case list - the same.
Opening a single case details, and switching to the Observables tab takes "ages" before the UI actually displays the obs. (5-6 or more seconds).
This makes TH4 actually unusable as it currently is.

Has anybody observed similar slowness ?
Or is it super-fast for most users ? (which would indicate that something is pretty wrong in our local setup)

Complementary information

Here are a few lines of thehive log (filtered to show the times the searches take) :

(look at: case/_search and case/artifact/_search numbers)

[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/task/log/_search?range=0-100&nparent=1 took 4305ms and returned 200 2 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/task/_stats took 9293ms and returned 200 11 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 13057ms and returned 200 28 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12622ms and returned 200 92 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12626ms and returned 200 28 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12634ms and returned 200 31 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_search?range=all&sort=-startDate&nstats=true took 12713ms and returned 200 645 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/stream took 3ms and returned 200 20 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/v0/query took 402ms and returned 200 125878 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/alert/_stats took 2798ms and returned 200 40 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_stats took 2947ms and returned 200 703 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_stats took 2973ms and returned 200 61 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_search?range=0-15&sort=-flag&sort=-startDate&nstats=true took 4621ms and returned 200 31861 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_stats took 2159ms and returned 200 14 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/task/_stats took 5206ms and returned 200 39 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/task/_stats took 5671ms and returned 200 11 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_stats took 2832ms and returned 200 61 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_search?range=0-15&sort=-flag&sort=-startDate&nstats=true took 3062ms and returned 200 31358 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/_search?range=0-15&sort=-flag&sort=-startDate&nstats=true took 3133ms and returned 200 30828 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/connector/cortex/action/_search?range=0-100&sort=-startDate took 13ms and returned 200 2 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/alert/_stats took 1252ms and returned 200 31 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/task/log/_search?range=0-100&nparent=1 took 4100ms and returned 200 2 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/task/_stats took 8976ms and returned 200 11 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12228ms and returned 200 28 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12120ms and returned 200 61 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12099ms and returned 200 30 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_stats took 12101ms and returned 200 31 bytes
[info] o.t.s.AccessLogFilter - 10.91.128.2 POST /api/case/artifact/_search?range=all&sort=-startDate&nstats=true took 12153ms and returned 200 317 bytes

@martinr103 martinr103 added TheHive4 TheHive4 related issues bug labels May 18, 2020
@crackytsi
Copy link

I have the same experience on my signle host system, but I'm still feeding data into TheHive, so I wasn't sure if this is the root-cause.
Additional I can tell you that I duplicated the CPU cores to ensure better performance.

@crackytsi
Copy link

On a single Node cluster with 32 GB RAM and 2 CPUs a 2.2 GHz it takes about 6 Seconds to mark an Alert as unread. The CPU is running 2 processes on 100% :-(

@martinr103
Copy link
Author

Oh yes, now I also checked the cpu load.
[This is by the way running on a 32-virtual-cpu-cores machine (16 physical)]

During a "UI operation" in TH4, both java processes (thehive, and cassandra) go as high as 300%-400% cpu usage. Means each of them consumes like 3-4 cpu cores at 100%. And this goes for a few seconds, until the loading of data is done.

Comparably, the elastic java process on the prod instance rarely goes over 15-30% cpu, during normal Theive-UI-operations.

@To-om
Copy link
Contributor

To-om commented May 29, 2020

TheHive 4 is slower than TheHive 3, it must manage the relationship between entities and must check permission and visibility. Moreover, the database is now transnational. But it should be slightly slower, not as much as you describe it.
Can you provide logs related to a simple operation, mark an alert as read for example ?
You can activate full logs by adding a logger in /etc/thehive/logback.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration debug="false">
    <!-- [...] -->

    <logger name="org.thp" level="TRACE"/> <!-- add this line -->

    <root level="INFO">
        <appender-ref ref="ASYNCFILE"/>
        <appender-ref ref="ASYNCSTDOUT"/>
    </root>
</configuration>

@To-om To-om self-assigned this May 29, 2020
@martinr103
Copy link
Author

I have enabled the TRACE level.
Where can I post the (lengthy) log output? Just paste it here in this issue thread?

@To-om
Copy link
Contributor

To-om commented Jun 3, 2020

@martinr103 you can create a gist

@martinr103
Copy link
Author

Oh, I just found the hint at the bottom of this message box - that you can actually drag&drop files into it , to create attachments. Should have seen it earlier :)

Ok, so attached is the requested log from the operation "mark alert as read". I haven't measured the time, but it took several seconds - from the click on the envelope icon until the screen refreshed.

th4_trace.txt

@To-om
Copy link
Contributor

To-om commented Jun 3, 2020

These logs comes from console, not from log file (/var/log/thehive/application.log). They doesn't contain timestamps. Nevertheless, I can see:

POST /api/alert/40968440/markAsRead took 38ms and returned 204 0 bytes

38ms is not long.
But the search and stats queries are long:

POST /api/alert/_search?range=0-15&sort=-date took 2903ms
POST /api/alert/_stats took 1753ms and returned 200 37 bytes
POST /api/alert/_stats took 1811ms and returned 200 384 bytes
POST /api/alert/_stats took 2623ms and returned 200 62 bytes
POST /api/alert/_stats took 2633ms and returned 200 62 bytes

There are heavy processes behind there requests because the user permission must be check for each alerts and because they still use TheHive3 compatible query format. I'll try to optimize them.
How many alerts do you have ?

@martinr103
Copy link
Author

Ah yes, sorry, below is the same type of logs, but now with timestamps. Maybe it helps a little more.

We have ~4700 alerts in this database.

Basically I can confirm... yes all the searches take long (like: search for all open cases, search for all closed cases, search for alerts, search for artifacts of 1 case, etc...)

th4_trace_with_times.txt

@To-om
Copy link
Contributor

To-om commented Jun 8, 2020

I'd like to check if thehive3 query format is the real reason of performance problem. Can you execute the following command:

time curl http://${THEHIVE_URL}/api/v1/query \
  -s -u ${LOGIN}:${PASSWORD} \
  -H "Content-type: application/json" \
  -d '
{
  "query": [
    {"_name": "listAlert"},
    {"_name": "page", "from": 0, "to": 15}
  ]
}'

Replace ${LOGIN}, ${PASSWORD} and ${THEHIVE_URL} by the correct values.

@martinr103
Copy link
Author

martinr103 commented Jun 9, 2020

Getting an error from the described command:

{"type":"AttributeCheckingError","message":"[Attribute withStats is missing]","errors":[{"name":"withStats","type":"MissingAttributeError"}]}
real    0m3.253s
user    0m0.001s
sys     0m0.004s

@crackytsi
Copy link

crackytsi commented Jun 10, 2020

Here some results from my installation to compare it.
I currently have imported 2900 of 6000 Cases and 16 Alerts (after running import it since 5 days) on a 2(hyper-threading logical 4) core 2,2 Ghz system:

#_Alerts_

time curl http://x.x.x.x:9000/api/v1/query   -s -u user:password   -H "Content-type: application/json"   -d '
{
  "query": [
    {"_name": "listAlert"},
    {"_name": "page", "from": 0, "to": 15}
  ]
}'
real    0m0.402s
user    0m0.008s
sys     0m0.008s


time curl "http://x.x.x.x:9000/api/alert/_search?range=0-15&sort=-date" -s -u user:password -H "Content-type: application/json"   -d '
{"query":{"_and":[{}]}
}'
real    0m0.958s
user    0m0.012s
sys     0m0.008s


#_Cases_

# time curl http://x.x.x.x:9000/api/v1/query   -s -u user:password -H "Content-type: application/json"   -d '
{
  "query": [
    {"_name": "listCase"},
    {"_name": "page", "from": 0, "to": 15}
  ]
}'
real    0m3.431s
user    0m0.016s
sys     0m0.000s



time curl "http://x.x.x.x:9000/api/case/_search?range=0-15&sort=-flag&sort=-startDate&nstats=true" -s -u user:password -H "Content-type: application/json"   -d '{
"query":{"_and":[{}]}
}'
real    0m7.673s
user    0m0.016s
sys     0m0.004s

@mamoedo
Copy link

mamoedo commented Jun 10, 2020

I have similar results. I've been running the migration since 20 May and it's still running (after several resumes, due to errors)

time curl -k https://x.x.x.x:9000/api/v1/query   -s -u user:password   -H "Content-type: application/json"   -d '
{
  "query": [
    {"_name": "listAlert"},
    {"_name": "page", "from": 0, "to": 15}
  ]
}'
real      0m 0.37s
user    0m 0.01s
sys     0m 0.01s

 


time curl -k "https://x.x.x.x:9000/api/alert/_search?range=0-15&sort=-date"
real    0m 5.67s
user    0m 0.00s
sys     0m 0.01s

@To-om
Copy link
Contributor

To-om commented Jun 10, 2020

@martinr103 the attribute withStats has been set optional in the last version of TheHive. You can add it in: {"_name": "page", "from": 0, "to": 15, "withStats": false}

@crackytsi @mamoedo thank you. There is an important difference.
The query from TheHive3 also sorts the result. For a correct comparison, a sort should be added in TheHive4 query:

"query": [
    {"_name": "listAlert"},
    {"_name": "sort", "_fields": [{"date": "asc"}]},
    {"_name": "page", "from": 0, "to": 15, "withStats": false}
  ]

@crackytsi I'm surprised by your results. 3 seconds to retrieve the 15 first cases is very long. I think your platform is overloaded.

@martinr103
Copy link
Author

Thanks @To-om. With the modified command {"_name": "page", "from": 0, "to": 15, "withStats": false}
I am getting this result:

[{...lots of output...}]
real    0m3.986s
user    0m0.003s
sys     0m0.002s

Out of curiosity I also tried one of @crackytsi searches, and got these numbers:

time curl "http://localhost:9000/api/alert/_search?range=0-15&sort=-date" -s -u XXXXXXXX -H "Content-type: application/json"   -d '
 {"query":{"_and":[{}]}
 }'

[{...lots of output...}]
real    0m7.221s
user    0m0.002s
sys     0m0.004s

On second attempt, the same query:

real    0m5.288s
user    0m0.001s
sys     0m0.004s

So yeah... it's consistently slow ... 4,5,7 seconds for the alert search.

@crackytsi
Copy link

crackytsi commented Jun 10, 2020

@To-om: Yes platform might be overloaded. Actually it was the same host that runs without any bigger CPU spikes TH3.4 on 2.2 Ghz with 6000 Cases. For the Cassandra Update, I added an additional CPU but the system is still under heavy load.
I also noticed, that the import speed it slowing down, as more cases I import. Do you have a recommendation for this amouth of data? Whats is your HW setup?

time curl http://x.x.x.x:9000/api/v1/query   -s -u user:password   -H "Content-type: application/json"   -d '{ "query": [
    {"_name": "listAlert"},
    {"_name": "sort", "_fields": [{"date": "asc"}]},
    {"_name": "page", "from": 0, "to": 15, "withStats": false}
  ]}'
Before Upgrade:
real    0m0.444s
user    0m0.012s
sys     0m0.004s


I added 2 more CPUs and more memory:
During browsing through the cases, I can't see any CPU limitation (even not on a single core).

top - 10:43:32 up 21 min,  2 users,  load average: 0.96, 0.58, 0.47
Tasks: 195 total,   1 running, 194 sleeping,   0 stopped,   0 zombie
%Cpu0  :  5.9 us,  2.8 sy,  0.0 ni, 91.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  6.8 us,  3.4 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  : 11.4 us,  4.6 sy,  0.0 ni, 84.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 10.6 us,  4.4 sy,  0.0 ni, 85.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 19.2 us, 12.5 sy,  0.0 ni, 65.9 id,  0.0 wa,  0.0 hi,  2.4 si,  0.0 st
%Cpu5  : 22.2 us, 12.0 sy,  0.0 ni, 62.7 id,  0.0 wa,  0.0 hi,  3.2 si,  0.0 st
%Cpu6  :  4.9 us,  2.1 sy,  0.0 ni, 92.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  7.0 us,  2.4 sy,  0.0 ni, 90.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  66114116 total, 15606444 used, 50507672 free,   100696 buffers
KiB Swap:  3019772 total,        0 used,  3019772 free.   994896 cached Mem

After Upgrade for loading alerts (note, that there are only 16):
real    0m0.380s
user    0m0.012s
sys     0m0.004s

Loading cases (currently 2900, no import-job running) after HW upgrade:
 time curl http://x.x.x.x:9000/api/v1/query   -s -u user:password  -H "Content-type: application/json"   -d '{ "query": [
    {"_name": "listCase"},
    {"_name": "sort", "_fields": [{"startDate": "asc"}]},
    {"_name": "page", "from": 0, "to": 15}
  ]}'
real    0m6.768s
user    0m0.020s
sys     0m0.004s

Never the less browsing through cases is still slow (takes arround 5-10 seconds per page-load)

@rriclet
Copy link
Contributor

rriclet commented Mar 19, 2021

This issue has been fixed with #1731, included in release 4.1.0.
Feel free to re-open this issue if you still notice slow performance.

@rriclet rriclet closed this as completed Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants