Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the default text analyzer to some fields #680

Merged
merged 17 commits into from
Dec 12, 2019
Merged

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Dec 6, 2019

This implements many fields mentioned in #570.

Note: I'm not adding host.name, mentioned in 570 because the default analyzer doesn't split on -. So I'm not sure it's worth adding. Please let me know if I'm missing something.

In going over the fields, I've identified a few more that I think may be interesting to add to this PR. Please voice your opinions. I'm happy to hold off or add now:

  • dns.question.name
  • threat.technique.name
  • vulnerability.description
  • tls.client.issuer, tls.client.subject, tls.server.issuer, tls.server.subject

@neu5ron @randomuserid @rw-access @MikePaquette @peasead

@webmat webmat self-assigned this Dec 6, 2019
@webmat webmat marked this pull request as ready for review December 6, 2019 20:13
@peasead
Copy link
Contributor

peasead commented Dec 6, 2019

@webmat I think that vulnerability.description is a fantastic use case for .text as it would allow someone to search for microsoft instead of microsoft has a vulnerability in... (if I'm understanding the difference between keyword and text properly).

@mbudge
Copy link
Contributor

mbudge commented Dec 7, 2019

Would ngram be better for filepath and process path?

These tend to be longer strings.

Wildcard search against a text field when searching TB's of data might be slow, if a company is collecting logs from a medium to large enterprise network.

"analyzer": { "ngram_analyzer": { "type": "custom", "tokenizer": "ngram_tokenizer", "filter": [ "lowercase" ] } }, "tokenizer": { "ngram_tokenizer": { "type": "ngram", "min_gram": 4, "max_gram": 4 } } },

This is how we index dns.question.name

" name": { "ignore_above": 1024, "normalizer": "lowercase_normalizer", "type": "keyword", "fields": { "ngram": { "type": "text", "analyzer": "ngram_analyzer" } } },

@webmat
Copy link
Contributor Author

webmat commented Dec 9, 2019

@peasead Yes, that's exactly the reason I'm considering adding it there :-)

@mbudge Agreed, there are better ways to index the path-like fields like path, url etc. Still, I think it's good to add the default .text analyzer anyway for a few reasons:

  • it works surprisingly well nonetheless. Words are tokenized on / and the stemming doesn't seem to be causing issues.
  • We will keep working on crafting a great analyzer for these more structured fields, but this will take more time. When that's ready, we will introduce this as a differently named multi-field (not .text), since it will work differently. @neu5ron already provided a great example of a better analyzer for these fields in HELK in further .text and keyword discussion #570.

So overall the thinking for the path fields should be interpreted as "progress over perfection". But we'll still deliver the perfection in time ;-) If the fields specific to paths are superior enough, we could even deprecate their .text and take them out in ECS 2.0 / Elastic Stack 8.0, and keep only the specialized analyzers. We'll see.

@webmat
Copy link
Contributor Author

webmat commented Dec 9, 2019

@mbudge And you also bring another good point on performance. Right now we're adding these in order to enable efficient detections, mostly for the SIEM alerting engine. If users want to remove these analyzers and lose the ability to do these detections, they're free to do that.

@webmat
Copy link
Contributor Author

webmat commented Dec 9, 2019

@dainperkins What would you think about having the default analyzer (full text search) on threat.technique.name? I'm not sure it's worth adding to threat.tactic.name as these are very high level and 1-2 words. But techniques are usually more detailed, and one could very well want to search for threat.technique.name:denial or something. WDYT?

@dainperkins
Copy link
Contributor

I think thats an excellent idea - I'll make a PR if you show me what needs to be done :)

@peasead
Copy link
Contributor

peasead commented Dec 9, 2019

Whoops, thanks for the reminder @dainperkins

@webmat do you need me to make a fresh PR with the changes to vulnerability.description?

@webmat
Copy link
Contributor Author

webmat commented Dec 10, 2019

@peasead @dainperkins Meant to respond earlier, sorry I forgot to hit "Comment" 😂 I added them both to this PR directly.

@peasead
Copy link
Contributor

peasead commented Dec 10, 2019

@peasead @dainperkins Meant to respond earlier, sorry I forgot to hit "Comment" 😂 I added them both to this PR directly.

Roger roger. I'll drop the PR. 👍

@webmat
Copy link
Contributor Author

webmat commented Dec 11, 2019

I consider this PR ready for final review, please voice opinions (esp. disagreement) soon, I'd like to merge tomorrow.

If you think additional fields would benefit from this, please voice your opinion. But this shouldn't be considered a blocker for the PR, they will be addressed in follow-up PRs :-)

@dainperkins
Copy link
Contributor

looks good to me

@webmat webmat requested a review from dainperkins December 11, 2019 18:45
@peasead
Copy link
Contributor

peasead commented Dec 11, 2019

LGTM

Copy link
Contributor

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, these fields make sense to me. Here are some others that we may want to consider subsequently:

For sure:

package.description

Maybe? (thinking about query patterns):

*.registered_domain
service.name
service.node.name

@andrewstucki andrewstucki mentioned this pull request Dec 11, 2019
@webmat
Copy link
Contributor Author

webmat commented Dec 11, 2019

Agree with some of those. I'd like to do subsequent PRs (in or after 1.4) for them, however.

Quick note on values that are dot separated (domains, hostnames) or even dash-separated (hostnames): the default analyzer doesn't deal well with them. They would require a specially crafted analyzer that breaks them up correctly :-)

@webmat webmat changed the title Add the default text indexer to some fields Add the default text analyzer to some fields Dec 12, 2019
@webmat webmat merged commit 1e2924a into elastic:master Dec 12, 2019
dcode pushed a commit to dcode/ecs that referenced this pull request Apr 15, 2020
…#680)

Note: fields that are reused elsewhere are getting the `text` multi-fields in all locations where they're reused as well.

`text` introduced on these fields:

- as.organization.name
- error.stack_trace
- file.path
- file.target_path
- http.request.body
- http.response.body
- organization.name
- os.name
- os.full
- process.executable
- process.name
- process.title
- process.command_line
- process.working_directory
- threat.technique.name
- url.original
- url.full
- user.name
- user.full
- vulnerability.description
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants