Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol names and http method shall be lowercased #253

Merged
merged 5 commits into from
Dec 11, 2018

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Dec 7, 2018

Just like #245 (http method), this will ensure aggregations don't have duplicate
entries for events with different capitalization (IPv4 vs IPV4 vs ipv4).

Closes #251

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@robgil robgil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will adversely impact users who use known lists of values. In the case of network.transport, the capitalization is static as its a public reference that is typically used (IANA). It will be annoying to the user to have to take every list they use for lookups and convert to lower case.

The other case in which this becomes problematic is with logstash codecs which may be case sensitive. These codecs are already written and would require substantial effort to convert if the use of case is widespread in the codecs.

Overall, I think this should not be forced in the schema, but require a case insensitive search to do the correlation and aggregation.

As @webmat mentioned in chat -

A multi-field for `text` datatype on http method would let you do case insensitive search, but not case sensitive aggregations

So this may be a limitation of ES that we're forcing on users to resolve because they can't search and aggregate case insensitive terms.

@MikePaquette
Copy link
Contributor

@robgil makes a fair point.

Question: Is there any way the normalizer can help us solve this problem for ECS?

If not, then I'll vote to go forward with this change. If ECS is going to impose a burden, I'd rather see the extra burden placed on the ETL logic so that we can preserve all the capabilities of the Elastic Stack on the data once ingested and indexed.

@robgil
Copy link
Contributor

robgil commented Dec 7, 2018

@MikePaquette normalized may indeed solve the issue! I guess the question now is which fields should be "normalized" and applying it to the template.

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

Thanks for your input, @robgil.

You make a valid point that when doing things right, one (or one's tool) should use a canonical list to get predictable names, if such a list is available.

What we're trying to achieve here however, is to make sure that whatever source we get this information from, it's trivial to normalize the values. This instruction: "lowercase it" is much simpler to follow than "follow the IANA naming scheme". The former can even be done with tr in a Bash script.

In terms of adapting tooling, whatever decision we make with implementing ECS risks forcing something to change somehow. We've started ECS because things were scattered in all directions, with regards to naming, type, etc. So making these decisions will in some cases force annoying change in some places.

Although in this case, with reusable pipelines, it can be pretty easy to "patch this up" by creating an ingest pipeline that does this lowercase normalization in all the expected places with minimal disruption. Of course ideally you may want to take advantage of the normalization earlier, and end up normalizing as early as possible in your processing, but that can be done over time. I don't think there's a rush to fix it all in time for 7.0 necessarily. As long as we offer an easy enough way to help people get it done somehow.

For your point about being able to perform aggregations in a case insensitive manner. I don't think this is possible. Full text search indexing with datatype text will give it to you because part of the flexibility of this approach is to deconstruct each word enough that you can find all of the equivalencies despite the query details (diffs in gender, plural, capitalization, word stem, etc). The story is different for keyword indexing. The approach there is extremely fast indexing, storage efficiency and being able to perform aggregations precisely because it's an exact match strategy. I'm not sure we could get case insensitive queries or aggregations on keyword fields...

I'll do a bit more research on the matter, though.

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

Oh jeez, hadn't seen the two most recent comments. I'll check out normalizer in detail, I agree this looks promising :-)

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

Ok, I've checked it out, and this will not solve the normalization issue.

The search on this case-insensitive field will return both a "get" and a "GET" with the query my_field:get. However when aggregating on the field, the raw values ("get" and "GET") are still reported separately.

Try it out:

# Create index and index some example data
PUT normalized_idx
{ "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
    }}}},
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "keyword",
          "normalizer": "my_normalizer"
}}}}}
PUT normalized_idx/_doc/1
{ "my_field": "get" }
PUT normalized_idx/_doc/2
{ "my_field": "GET" }
PUT normalized_idx/_doc/3
{ "my_field": "MADNESS" }
# Admire your work
GET normalized_idx/_search

# Case insensitive search returns both get and GET. This is what normalizer does.
GET normalized_idx/_search?q=my_field:get

# Aggregation still lists them separately, because it's based on the _source,
# which ES indexing options never modifies
GET normalized_idx/_search
{ "aggs" : {
    "methods" : {
      "terms" : { "field" : "my_field" }
    }
  }
}

Aggregation results:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "normalized_idx",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "GET"
        }
      },
      {
        "_index" : "normalized_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "get"
        }
      },
      {
        "_index" : "normalized_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "MADNESS"
        }
      }
    ]
  },
  "aggregations" : {
    "methods" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "get",
          "doc_count" : 2
        },
        {
          "key" : "madness",
          "doc_count" : 1
        }
      ]
    }
  }
}

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

Jeez wait, I misread my aggregation results (was looking at the hits sample).

The aggregation result actually confirm that the result is lowercased!

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

The best practice with aggregations is to use size:0 to get only the aggregation. I should have done that :-)

GET normalized_idx/_search
{ "size": 0,
  "aggs" : {
    "methods" : {
      "terms" : { "field" : "my_field" }
    }
  }
}

No ambiguity:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "methods" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "get",
          "doc_count" : 2
        },
        {
          "key" : "madness",
          "doc_count" : 1
        }
      ]
    }
  }
}

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

@ruflin Let's discuss using the lowercase normalizer on Monday. I think this could work.

@ruflin
Copy link
Contributor

ruflin commented Dec 10, 2018

To keep this moving I suggest we change must to should.

For the normalizer: I think we should provide at one stage tooling around this but not a blocker for 1.0. Also this tooling should always be opt-in and not required.

@webmat
Copy link
Contributor Author

webmat commented Dec 10, 2018

I've been thinking about this. We can't solely rely on a normalizer in Elasticsearch. We're defining a schema, and even offering a reference ES template. But that doesn't mean people will use it as is. We've been saying that the spec is the readme because there are issues with the template being generated.

So the readme must mention this, but could offer some leeway and options.

It should be normalized to lowercase. You can do so prior to ingestion (for example with Ingest Node) or you can use the lowercase normalizer in your indices.

@MikePaquette
Copy link
Contributor

  • @webmat I am +1 with proceeding with this PR.
  • I like your suggestion of adding:

You can do so prior to ingestion (for example with Ingest Node) or you can use the lowercase normalizer in your indices.

  • The "SHOULD" language seems better than the "MUST" language.

Note: I try to use the rfc2119 definitions of SHOULD and MUST as follows:

  • USE MUST when a violation of the rule would result in inability to use ECS indexed data, such as type mismatches, failed queries, etc.
  • USE SHOULD when a violation of the rule might result in suboptimal, or unexpected results being successfully returned from a query, etc.

You can keep capitalization, as long as indexing is lowercase at the very least
@webmat webmat force-pushed the lowercase-protocols branch from c874583 to 09b6da0 Compare December 10, 2018 15:38
@webmat
Copy link
Contributor Author

webmat commented Dec 10, 2018

@MikePaquette I've updated the descriptions with a longer blurb than this, after all. Please check it out.

I have kept the word "must" but I mention an option that let you keep your capitalization in the source, but ensures that it's lowercased for querying. That is the goal, after all.

If we say this is optional, then solutions will have to assume that some sources don't respect this, and it's just the same as not enforcing capitalization at all. And we're back to searching for get and not getting all the "get"s, or aggregating, and seeing the same value appear multiple times capitalized differently.

However the proposed option to leave the source capitalized how you want, but using the lowercase filter in your normalizer ensures we are able to query / aggregate in a predictable fashion, while the events still contain the original capitalization.

Mathieu Martin added 2 commits December 10, 2018 13:54
Add a section explaining the acceptable implementations of the 'lowercase' normalization in the 'Implementing ECS' section of the readme.
@webmat webmat changed the title Protocol names shall be lowercased Protocol names and http method shall be lowercased Dec 10, 2018
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in favor of using "should" instead of must because of the argument that @MikePaquette made above.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as in the end I'm ok with both options as _source stays intact.

Copy link
Contributor

@MikePaquette MikePaquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too still think SHOULD is better than MUST, but I am approving because we can always "relax" the requirement later in documentation if we choose to.

@webmat webmat merged commit f187461 into elastic:master Dec 11, 2018
@webmat webmat deleted the lowercase-protocols branch December 11, 2018 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants