Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] modify rss http getter to a simple fetch (#8736) #9006

Closed
wants to merge 2 commits into from

Conversation

JeremyCloarec
Copy link
Contributor

@JeremyCloarec JeremyCloarec commented Nov 14, 2024

Proposed changes

  • instead of using httpClient, use simple fetch to get RSS data

Related issues

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality
  • I wrote test cases for the relevant uses case (coverage and e2e)
  • I added/update the relevant documentation (either on github or on notion)
  • Where necessary I refactored code to improve the overall quality

Further comments

In the related issue, all linked feeds are now fetched without any 403 errors.
However, the https://cybersecurity.att.com/site/blog-all-rss feed isn't ingested properly, because items in this feed don't have any pubDate metadata, they only have a dc:date. Not sure if we want to modify the RSS parser to use dc:date if no pubDate exist in the item?

@github-actions github-actions bot added the filigran team use to identify PR from the Filigran team label Nov 14, 2024
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 66.25%. Comparing base (96bbd5a) to head (bf2ef75).
Report is 160 commits behind head on master.

Files with missing lines Patch % Lines
...rm/opencti-graphql/src/manager/ingestionManager.ts 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9006      +/-   ##
==========================================
- Coverage   66.28%   66.25%   -0.04%     
==========================================
  Files         597      597              
  Lines       61098    61156      +58     
  Branches     6287     6288       +1     
==========================================
+ Hits        40501    40521      +20     
- Misses      20597    20635      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@lndrtrbn lndrtrbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what are the advantages of using our custom httpClient but are we ok with the fact to bypass it?

@JeremyCloarec JeremyCloarec added the do not merge Do not merge this PR until this tag will be removed label Nov 14, 2024
@JeremyCloarec
Copy link
Contributor Author

I don't know what are the advantages of using our custom httpClient but are we ok with the fact to bypass it?

I'm not sure either, but it was the only way I was able to bypass the 403 errors. We talked about it with @romain-filigran, and the plan will be to merge it to master and keep a close eye on wether the previous RSS feeds break following this change. If that is the case, this will need to be reverted

@aHenryJard
Copy link
Member

I think that the opencti httpclient manages at least proxy configuration, have you test your PR behind a proxy ?

Copy link
Member

@aHenryJard aHenryJard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the root cause of the 403, have you found it ? If not maybe it's because we are requesting too much this URL (DDOS protection somehow or bot protection). I don't understand how changing the http client is solving issue can you elaborate please ?

I don't see any usage of proxy and proxy CA so I'm blocking until you give me feedback on the proxy settings.

@JeremyCloarec
Copy link
Contributor Author

I didn't think about proxy settings you're right, this solution doesn't work.
I wasn't able to find the root cause, I suspect a Cloudflare bot protection, but I didn't find any proper way to understand what triggers the 403 rejections. What's weird is that even when sending the exact same request as the browser but with curl, I get a 403 error. When the same request in the browser works properly...
Would you be up for a pair debugging session to dig into it?

@JeremyCloarec JeremyCloarec marked this pull request as draft November 26, 2024 14:09
@SamuelHassine SamuelHassine deleted the issue/8736 branch January 5, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Do not merge this PR until this tag will be removed filigran team use to identify PR from the Filigran team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants