Add the default text analyzer to some fields #680

webmat · 2019-12-06T19:55:52Z

This implements many fields mentioned in #570.

Note: I'm not adding host.name, mentioned in 570 because the default analyzer doesn't split on -. So I'm not sure it's worth adding. Please let me know if I'm missing something.

In going over the fields, I've identified a few more that I think may be interesting to add to this PR. Please voice your opinions. I'm happy to hold off or add now:

dns.question.name
threat.technique.name
vulnerability.description
tls.client.issuer, tls.client.subject, tls.server.issuer, tls.server.subject

@neu5ron @randomuserid @rw-access @MikePaquette @peasead

peasead · 2019-12-06T21:23:29Z

@webmat I think that vulnerability.description is a fantastic use case for .text as it would allow someone to search for microsoft instead of microsoft has a vulnerability in... (if I'm understanding the difference between keyword and text properly).

mbudge · 2019-12-07T08:36:33Z

Would ngram be better for filepath and process path?

These tend to be longer strings.

Wildcard search against a text field when searching TB's of data might be slow, if a company is collecting logs from a medium to large enterprise network.

"analyzer": { "ngram_analyzer": { "type": "custom", "tokenizer": "ngram_tokenizer", "filter": [ "lowercase" ] } }, "tokenizer": { "ngram_tokenizer": { "type": "ngram", "min_gram": 4, "max_gram": 4 } } },

This is how we index dns.question.name

" name": { "ignore_above": 1024, "normalizer": "lowercase_normalizer", "type": "keyword", "fields": { "ngram": { "type": "text", "analyzer": "ngram_analyzer" } } },

webmat · 2019-12-09T14:56:28Z

@peasead Yes, that's exactly the reason I'm considering adding it there :-)

@mbudge Agreed, there are better ways to index the path-like fields like path, url etc. Still, I think it's good to add the default .text analyzer anyway for a few reasons:

it works surprisingly well nonetheless. Words are tokenized on / and the stemming doesn't seem to be causing issues.
We will keep working on crafting a great analyzer for these more structured fields, but this will take more time. When that's ready, we will introduce this as a differently named multi-field (not .text), since it will work differently. @neu5ron already provided a great example of a better analyzer for these fields in HELK in further .text and keyword discussion #570.

So overall the thinking for the path fields should be interpreted as "progress over perfection". But we'll still deliver the perfection in time ;-) If the fields specific to paths are superior enough, we could even deprecate their .text and take them out in ECS 2.0 / Elastic Stack 8.0, and keep only the specialized analyzers. We'll see.

webmat · 2019-12-09T14:59:17Z

@mbudge And you also bring another good point on performance. Right now we're adding these in order to enable efficient detections, mostly for the SIEM alerting engine. If users want to remove these analyzers and lose the ability to do these detections, they're free to do that.

webmat · 2019-12-09T15:29:46Z

@dainperkins What would you think about having the default analyzer (full text search) on threat.technique.name? I'm not sure it's worth adding to threat.tactic.name as these are very high level and 1-2 words. But techniques are usually more detailed, and one could very well want to search for threat.technique.name:denial or something. WDYT?

dainperkins · 2019-12-09T18:23:15Z

I think thats an excellent idea - I'll make a PR if you show me what needs to be done :)

peasead · 2019-12-09T19:20:18Z

Whoops, thanks for the reminder @dainperkins

@webmat do you need me to make a fresh PR with the changes to vulnerability.description?

webmat · 2019-12-10T21:56:47Z

@peasead @dainperkins Meant to respond earlier, sorry I forgot to hit "Comment" 😂 I added them both to this PR directly.

peasead · 2019-12-10T22:02:10Z

@peasead @dainperkins Meant to respond earlier, sorry I forgot to hit "Comment" 😂 I added them both to this PR directly.

Roger roger. I'll drop the PR. 👍

webmat · 2019-12-11T18:11:15Z

I consider this PR ready for final review, please voice opinions (esp. disagreement) soon, I'd like to merge tomorrow.

If you think additional fields would benefit from this, please voice your opinion. But this shouldn't be considered a blocker for the PR, they will be addressed in follow-up PRs :-)

dainperkins · 2019-12-11T18:23:11Z

looks good to me

peasead · 2019-12-11T19:09:01Z

LGTM

andrewstucki

In general, these fields make sense to me. Here are some others that we may want to consider subsequently:

For sure:

package.description

Maybe? (thinking about query patterns):

*.registered_domain
service.name
service.node.name

webmat · 2019-12-11T21:15:19Z

Agree with some of those. I'd like to do subsequent PRs (in or after 1.4) for them, however.

Quick note on values that are dot separated (domains, hostnames) or even dash-separated (hostnames): the default analyzer doesn't deal well with them. They would require a specially crafted analyzer that breaks them up correctly :-)

…#680) Note: fields that are reused elsewhere are getting the `text` multi-fields in all locations where they're reused as well. `text` introduced on these fields: - as.organization.name - error.stack_trace - file.path - file.target_path - http.request.body - http.response.body - organization.name - os.name - os.full - process.executable - process.name - process.title - process.command_line - process.working_directory - threat.technique.name - url.original - url.full - user.name - user.full - vulnerability.description

Mathieu Martin added 12 commits December 6, 2019 13:54

as.text

9d5cd7f

process.text

8b3cc58

url.text

2b5871d

user.text

a60988d

error.text

79fb33e

file.text

77643f9

http.text

19b3aec

os.yml

5d6896e

organization.text

d3a15c3

Fix schema_reader to loop over list of mf entries, during deep copy

da8bc9b

Try bullets for mf in asciidoc

0db474d

make

fcf23ce

webmat self-assigned this Dec 6, 2019

webmat marked this pull request as ready for review December 6, 2019 20:13

This was referenced Dec 6, 2019

Split user-agent original into text and keyword #555

Closed

Case sensitivity for keywords #623

Closed

peasead mentioned this pull request Dec 10, 2019

updated description to text datatype #690

Closed

Mathieu Martin added 3 commits December 10, 2019 16:53

Space. The final frontier.

23d3027

Add .text to threat.technique.name and vulnerability.description

3126cd4

Re-generate

b060aa4

Merge branch 'master' into moar-text

ca6d08a

webmat requested a review from dainperkins December 11, 2019 18:45

andrewstucki approved these changes Dec 11, 2019

View reviewed changes

andrewstucki mentioned this pull request Dec 11, 2019

added rule fields #665

Merged

Changelog

5ee11d5

webmat changed the title ~~Add the default text indexer to some fields~~ Add the default text analyzer to some fields Dec 12, 2019

webmat merged commit 1e2924a into elastic:master Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the default text analyzer to some fields #680

Add the default text analyzer to some fields #680

webmat commented Dec 6, 2019 •

edited

Loading

peasead commented Dec 6, 2019

mbudge commented Dec 7, 2019 •

edited

Loading

webmat commented Dec 9, 2019 •

edited

Loading

webmat commented Dec 9, 2019

webmat commented Dec 9, 2019

dainperkins commented Dec 9, 2019

peasead commented Dec 9, 2019

webmat commented Dec 10, 2019

peasead commented Dec 10, 2019

webmat commented Dec 11, 2019

dainperkins commented Dec 11, 2019

peasead commented Dec 11, 2019

andrewstucki left a comment

webmat commented Dec 11, 2019 •

edited

Loading

Add the default text analyzer to some fields #680

Add the default text analyzer to some fields #680

Conversation

webmat commented Dec 6, 2019 • edited Loading

peasead commented Dec 6, 2019

mbudge commented Dec 7, 2019 • edited Loading

webmat commented Dec 9, 2019 • edited Loading

webmat commented Dec 9, 2019

webmat commented Dec 9, 2019

dainperkins commented Dec 9, 2019

peasead commented Dec 9, 2019

webmat commented Dec 10, 2019

peasead commented Dec 10, 2019

webmat commented Dec 11, 2019

dainperkins commented Dec 11, 2019

peasead commented Dec 11, 2019

andrewstucki left a comment

Choose a reason for hiding this comment

webmat commented Dec 11, 2019 • edited Loading

webmat commented Dec 6, 2019 •

edited

Loading

mbudge commented Dec 7, 2019 •

edited

Loading

webmat commented Dec 9, 2019 •

edited

Loading

webmat commented Dec 11, 2019 •

edited

Loading