-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol names and http method shall be lowercased #253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will adversely impact users who use known lists of values. In the case of network.transport
, the capitalization is static as its a public reference that is typically used (IANA). It will be annoying to the user to have to take every list they use for lookups and convert to lower case.
The other case in which this becomes problematic is with logstash codecs
which may be case sensitive. These codecs are already written and would require substantial effort to convert if the use of case
is widespread in the codecs.
Overall, I think this should not be forced in the schema, but require a case insensitive search
to do the correlation and aggregation.
As @webmat mentioned in chat -
A multi-field for `text` datatype on http method would let you do case insensitive search, but not case sensitive aggregations
So this may be a limitation of ES that we're forcing on users to resolve because they can't search and aggregate case insensitive terms.
@robgil makes a fair point. Question: Is there any way the normalizer can help us solve this problem for ECS? If not, then I'll vote to go forward with this change. If ECS is going to impose a burden, I'd rather see the extra burden placed on the ETL logic so that we can preserve all the capabilities of the Elastic Stack on the data once ingested and indexed. |
@MikePaquette |
Thanks for your input, @robgil. You make a valid point that when doing things right, one (or one's tool) should use a canonical list to get predictable names, if such a list is available. What we're trying to achieve here however, is to make sure that whatever source we get this information from, it's trivial to normalize the values. This instruction: "lowercase it" is much simpler to follow than "follow the IANA naming scheme". The former can even be done with In terms of adapting tooling, whatever decision we make with implementing ECS risks forcing something to change somehow. We've started ECS because things were scattered in all directions, with regards to naming, type, etc. So making these decisions will in some cases force annoying change in some places. Although in this case, with reusable pipelines, it can be pretty easy to "patch this up" by creating an ingest pipeline that does this lowercase normalization in all the expected places with minimal disruption. Of course ideally you may want to take advantage of the normalization earlier, and end up normalizing as early as possible in your processing, but that can be done over time. I don't think there's a rush to fix it all in time for 7.0 necessarily. As long as we offer an easy enough way to help people get it done somehow. For your point about being able to perform aggregations in a case insensitive manner. I don't think this is possible. Full text search indexing with datatype I'll do a bit more research on the matter, though. |
Oh jeez, hadn't seen the two most recent comments. I'll check out |
Ok, I've checked it out, and this will not solve the normalization issue. The search on this case-insensitive field will return both a "get" and a "GET" with the query Try it out:
Aggregation results:
|
Jeez wait, I misread my aggregation results (was looking at the hits sample). The aggregation result actually confirm that the result is lowercased! |
The best practice with aggregations is to use
No ambiguity:
|
@ruflin Let's discuss using the |
To keep this moving I suggest we change For the normalizer: I think we should provide at one stage tooling around this but not a blocker for 1.0. Also this tooling should always be opt-in and not required. |
I've been thinking about this. We can't solely rely on a normalizer in Elasticsearch. We're defining a schema, and even offering a reference ES template. But that doesn't mean people will use it as is. We've been saying that the spec is the readme because there are issues with the template being generated. So the readme must mention this, but could offer some leeway and options.
|
Note: I try to use the rfc2119 definitions of SHOULD and MUST as follows:
|
You can keep capitalization, as long as indexing is lowercase at the very least
c874583
to
09b6da0
Compare
@MikePaquette I've updated the descriptions with a longer blurb than this, after all. Please check it out. I have kept the word "must" but I mention an option that let you keep your capitalization in the source, but ensures that it's lowercased for querying. That is the goal, after all. If we say this is optional, then solutions will have to assume that some sources don't respect this, and it's just the same as not enforcing capitalization at all. And we're back to searching for However the proposed option to leave the source capitalized how you want, but using the |
Add a section explaining the acceptable implementations of the 'lowercase' normalization in the 'Implementing ECS' section of the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in favor of using "should" instead of must because of the argument that @MikePaquette made above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as in the end I'm ok with both options as _source
stays intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too still think SHOULD is better than MUST, but I am approving because we can always "relax" the requirement later in documentation if we choose to.
Just like #245 (http method), this will ensure aggregations don't have duplicate
entries for events with different capitalization (IPv4 vs IPV4 vs ipv4).
Closes #251