Protocol names and http method shall be lowercased #253

webmat · 2018-12-07T14:48:57Z

Just like #245 (http method), this will ensure aggregations don't have duplicate
entries for events with different capitalization (IPv4 vs IPV4 vs ipv4).

Closes #251

andrewkroh

Thanks!

robgil

This will adversely impact users who use known lists of values. In the case of network.transport, the capitalization is static as its a public reference that is typically used (IANA). It will be annoying to the user to have to take every list they use for lookups and convert to lower case.

The other case in which this becomes problematic is with logstash codecs which may be case sensitive. These codecs are already written and would require substantial effort to convert if the use of case is widespread in the codecs.

Overall, I think this should not be forced in the schema, but require a case insensitive search to do the correlation and aggregation.

As @webmat mentioned in chat -

A multi-field for `text` datatype on http method would let you do case insensitive search, but not case sensitive aggregations

So this may be a limitation of ES that we're forcing on users to resolve because they can't search and aggregate case insensitive terms.

MikePaquette · 2018-12-07T20:40:05Z

@robgil makes a fair point.

Question: Is there any way the normalizer can help us solve this problem for ECS?

If not, then I'll vote to go forward with this change. If ECS is going to impose a burden, I'd rather see the extra burden placed on the ETL logic so that we can preserve all the capabilities of the Elastic Stack on the data once ingested and indexed.

robgil · 2018-12-07T20:58:18Z

@MikePaquette normalized may indeed solve the issue! I guess the question now is which fields should be "normalized" and applying it to the template.

webmat · 2018-12-07T21:30:19Z

Thanks for your input, @robgil.

You make a valid point that when doing things right, one (or one's tool) should use a canonical list to get predictable names, if such a list is available.

What we're trying to achieve here however, is to make sure that whatever source we get this information from, it's trivial to normalize the values. This instruction: "lowercase it" is much simpler to follow than "follow the IANA naming scheme". The former can even be done with tr in a Bash script.

In terms of adapting tooling, whatever decision we make with implementing ECS risks forcing something to change somehow. We've started ECS because things were scattered in all directions, with regards to naming, type, etc. So making these decisions will in some cases force annoying change in some places.

Although in this case, with reusable pipelines, it can be pretty easy to "patch this up" by creating an ingest pipeline that does this lowercase normalization in all the expected places with minimal disruption. Of course ideally you may want to take advantage of the normalization earlier, and end up normalizing as early as possible in your processing, but that can be done over time. I don't think there's a rush to fix it all in time for 7.0 necessarily. As long as we offer an easy enough way to help people get it done somehow.

For your point about being able to perform aggregations in a case insensitive manner. I don't think this is possible. Full text search indexing with datatype text will give it to you because part of the flexibility of this approach is to deconstruct each word enough that you can find all of the equivalencies despite the query details (diffs in gender, plural, capitalization, word stem, etc). The story is different for keyword indexing. The approach there is extremely fast indexing, storage efficiency and being able to perform aggregations precisely because it's an exact match strategy. I'm not sure we could get case insensitive queries or aggregations on keyword fields...

I'll do a bit more research on the matter, though.

webmat · 2018-12-07T21:34:14Z

Oh jeez, hadn't seen the two most recent comments. I'll check out normalizer in detail, I agree this looks promising :-)

webmat · 2018-12-07T21:45:39Z

Ok, I've checked it out, and this will not solve the normalization issue.

The search on this case-insensitive field will return both a "get" and a "GET" with the query my_field:get. However when aggregating on the field, the raw values ("get" and "GET") are still reported separately.

Try it out:

# Create index and index some example data
PUT normalized_idx
{ "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
    }}}},
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "keyword",
          "normalizer": "my_normalizer"
}}}}}
PUT normalized_idx/_doc/1
{ "my_field": "get" }
PUT normalized_idx/_doc/2
{ "my_field": "GET" }
PUT normalized_idx/_doc/3
{ "my_field": "MADNESS" }
# Admire your work
GET normalized_idx/_search

# Case insensitive search returns both get and GET. This is what normalizer does.
GET normalized_idx/_search?q=my_field:get

# Aggregation still lists them separately, because it's based on the _source,
# which ES indexing options never modifies
GET normalized_idx/_search
{ "aggs" : {
    "methods" : {
      "terms" : { "field" : "my_field" }
    }
  }
}

Aggregation results:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "normalized_idx",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "GET"
        }
      },
      {
        "_index" : "normalized_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "get"
        }
      },
      {
        "_index" : "normalized_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "MADNESS"
        }
      }
    ]
  },
  "aggregations" : {
    "methods" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "get",
          "doc_count" : 2
        },
        {
          "key" : "madness",
          "doc_count" : 1
        }
      ]
    }
  }
}

webmat · 2018-12-07T21:48:32Z

Jeez wait, I misread my aggregation results (was looking at the hits sample).

The aggregation result actually confirm that the result is lowercased!

webmat · 2018-12-07T21:49:52Z

The best practice with aggregations is to use size:0 to get only the aggregation. I should have done that :-)

GET normalized_idx/_search
{ "size": 0,
  "aggs" : {
    "methods" : {
      "terms" : { "field" : "my_field" }
    }
  }
}

No ambiguity:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "methods" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "get",
          "doc_count" : 2
        },
        {
          "key" : "madness",
          "doc_count" : 1
        }
      ]
    }
  }
}

webmat · 2018-12-07T21:51:05Z

@ruflin Let's discuss using the lowercase normalizer on Monday. I think this could work.

ruflin · 2018-12-10T12:05:17Z

To keep this moving I suggest we change must to should.

For the normalizer: I think we should provide at one stage tooling around this but not a blocker for 1.0. Also this tooling should always be opt-in and not required.

webmat · 2018-12-10T13:57:35Z

I've been thinking about this. We can't solely rely on a normalizer in Elasticsearch. We're defining a schema, and even offering a reference ES template. But that doesn't mean people will use it as is. We've been saying that the spec is the readme because there are issues with the template being generated.

So the readme must mention this, but could offer some leeway and options.

It should be normalized to lowercase. You can do so prior to ingestion (for example with Ingest Node) or you can use the lowercase normalizer in your indices.

MikePaquette · 2018-12-10T15:04:41Z

@webmat I am +1 with proceeding with this PR.
I like your suggestion of adding:

You can do so prior to ingestion (for example with Ingest Node) or you can use the lowercase normalizer in your indices.

The "SHOULD" language seems better than the "MUST" language.

Note: I try to use the rfc2119 definitions of SHOULD and MUST as follows:

USE MUST when a violation of the rule would result in inability to use ECS indexed data, such as type mismatches, failed queries, etc.
USE SHOULD when a violation of the rule might result in suboptimal, or unexpected results being successfully returned from a query, etc.

You can keep capitalization, as long as indexing is lowercase at the very least

webmat · 2018-12-10T15:44:57Z

@MikePaquette I've updated the descriptions with a longer blurb than this, after all. Please check it out.

I have kept the word "must" but I mention an option that let you keep your capitalization in the source, but ensures that it's lowercased for querying. That is the goal, after all.

If we say this is optional, then solutions will have to assume that some sources don't respect this, and it's just the same as not enforcing capitalization at all. And we're back to searching for get and not getting all the "get"s, or aggregating, and seeing the same value appear multiple times capitalized differently.

However the proposed option to leave the source capitalized how you want, but using the lowercase filter in your normalizer ensures we are able to query / aggregate in a predictable fashion, while the events still contain the original capitalization.

Add a section explaining the acceptable implementations of the 'lowercase' normalization in the 'Implementing ECS' section of the readme.

ruflin

I'm still in favor of using "should" instead of must because of the argument that @MikePaquette made above.

ruflin

Approving as in the end I'm ok with both options as _source stays intact.

MikePaquette

I too still think SHOULD is better than MUST, but I am approving because we can always "relax" the requirement later in documentation if we choose to.

webmat requested review from ruflin, andrewkroh and robgil December 7, 2018 14:49

webmat self-assigned this Dec 7, 2018

webmat added the 1.0.0-beta2 label Dec 7, 2018

webmat mentioned this pull request Dec 7, 2018

http.request.method must be lowercase. #245

Closed

ruflin approved these changes Dec 7, 2018

View reviewed changes

andrewkroh approved these changes Dec 7, 2018

View reviewed changes

robgil suggested changes Dec 7, 2018

View reviewed changes

webmat mentioned this pull request Dec 10, 2018

Apply lowercase normalizer to certain keyword fields in template #260

Open

Mathieu Martin added 2 commits December 10, 2018 09:41

Protocol names shall be lowercased

d339a89

Changelog

bc05ddc

Update the definition to mention options.

09b6da0

You can keep capitalization, as long as indexing is lowercase at the very least

webmat force-pushed the lowercase-protocols branch from c874583 to 09b6da0 Compare December 10, 2018 15:38

Mathieu Martin added 2 commits December 10, 2018 13:54

Make the lowercase definitions simple again.

7f63d1f

Add a section explaining the acceptable implementations of the 'lowercase' normalization in the 'Implementing ECS' section of the readme.

Move the lowercase requirement of http method to this PR.

be3bef3

webmat changed the title ~~Protocol names shall be lowercased~~ Protocol names and http method shall be lowercased Dec 10, 2018

ruflin reviewed Dec 11, 2018

View reviewed changes

ruflin approved these changes Dec 11, 2018

View reviewed changes

MikePaquette approved these changes Dec 11, 2018

View reviewed changes

webmat merged commit f187461 into elastic:master Dec 11, 2018

webmat deleted the lowercase-protocols branch December 11, 2018 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protocol names and http method shall be lowercased #253

Protocol names and http method shall be lowercased #253

webmat commented Dec 7, 2018

andrewkroh left a comment

robgil left a comment

MikePaquette commented Dec 7, 2018

robgil commented Dec 7, 2018

webmat commented Dec 7, 2018

webmat commented Dec 7, 2018

webmat commented Dec 7, 2018 •

edited

Loading

webmat commented Dec 7, 2018 •

edited

Loading

webmat commented Dec 7, 2018

webmat commented Dec 7, 2018

ruflin commented Dec 10, 2018

webmat commented Dec 10, 2018

MikePaquette commented Dec 10, 2018

webmat commented Dec 10, 2018 •

edited

Loading

ruflin left a comment

ruflin left a comment

MikePaquette left a comment

Protocol names and http method shall be lowercased #253

Protocol names and http method shall be lowercased #253

Conversation

webmat commented Dec 7, 2018

andrewkroh left a comment

Choose a reason for hiding this comment

robgil left a comment

Choose a reason for hiding this comment

MikePaquette commented Dec 7, 2018

robgil commented Dec 7, 2018

webmat commented Dec 7, 2018

webmat commented Dec 7, 2018

webmat commented Dec 7, 2018 • edited Loading

webmat commented Dec 7, 2018 • edited Loading

webmat commented Dec 7, 2018

webmat commented Dec 7, 2018

ruflin commented Dec 10, 2018

webmat commented Dec 10, 2018

MikePaquette commented Dec 10, 2018

webmat commented Dec 10, 2018 • edited Loading

ruflin left a comment

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

MikePaquette left a comment

Choose a reason for hiding this comment

webmat commented Dec 7, 2018 •

edited

Loading

webmat commented Dec 7, 2018 •

edited

Loading

webmat commented Dec 10, 2018 •

edited

Loading