Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate outputs for each individual subset with custom options #873

Merged
merged 13 commits into from
Jul 13, 2020

Conversation

marshallmain
Copy link
Contributor

@marshallmain marshallmain commented Jun 18, 2020

The main purpose of this PR is to produce intermediate generated files where the fields can be extended with custom options in subset files. As a user, I can then write a script to read the intermediate files and generate other necessary files from them. To achieve this, some (small) breaking changes to the subset format are needed.

This PR makes some changes and improvements to the subset format. While there are breaking changes, updating subsets to match the changes is easy. It would probably also be good to mark the subset feature as experimental.

  • BREAKING: Remove support for missing fields keys being treated as equivalent to fields: '*'. Instead, fields is required as a key for any field that has sub-fields in the schema (object or nested), and fields is required NOT to be a key for fields that don't have sub-fields. This makes it easier to support custom options on individual fields since we don't have cases where fields may or may not be an option. IMO this is also a more intuitive format since leaf fields will no longer have the awkward fields: '*' option and objects will clearly show if they are including all subfields.
  • BREAKING: adds name and fields as top level fields in subsets, rather than having the top level schema fields be the top level keys. The name is used as the directory to write individual intermediate generated files to. fields is added to keep the fields (what used to be the entire subset file) separate from the name.
  • When the --subset option is used each subset will produce intermediate generated files individually before the subsets are merged to produce the complete intermediate files. When creating the individual versions, any custom options on subset fields are added to the schema and therefore show up in the generated files. This is useful for annotating fields that have certain properties outside of elasticsearch - for example, we want to keep track of which fields are valid for users to key on when creating rule exceptions so we'll add a custom option exceptionable: true to those fields in the subset, and that will show up in the generated file as well.
  • The complete intermediate file will not contain any custom fields because we don't know what semantics to use when merging if a custom option is present on a field in both subsets.

Example field, annotated with exceptionable: true:

event.action:
  dashed_name: event-action
  description: 'The action captured by the event.

    This describes the information in the event. It is more specific than `event.category`.
    Examples are `group-add`, `process-started`, `file-created`. The value is normally
    defined by the implementer.'
  example: user-password-change
  exceptionable: true
  flat_name: event.action
  ignore_above: 1024
  level: core
  name: action
  normalize: []
  short: The action captured by the event.
  type: keyword

Subset that would create this field:

---
name: malware_event
fields:
  base:
    fields:
      "@timestamp": {}
  agent:
    fields: "*"
  dll:
    fields: "*"
  ecs:
    fields: "*"
  event:
    fields:
      action:
        exceptionable: true

@ebeahan
Copy link
Member

ebeahan commented Jun 24, 2020

Thanks @marshallmain! I like the top-level additions of name and field and removing the leaf fields containing the fields: "*" option. A couple of comments:

  • For directory organization it may be easier to place all individual subset directories into one common subset directory (e.g. <out>/generated/subset/<name_one>/, <out>/generated/subset/<name_two>/, etc.).

  • Once a custom option is defined on a field in a subset, the generated output only contains the defined field(s). I don't believe this behavior was introduced with this PR - just an observation.

For example declaring the following in the subset:

---
name: host_event
fields:
  host:
    fields:
      name:
        exceptionable: true

Filters all host fields except for host.name in the generated artifacts.

Imagine a scenario with only one sub-field using a custom option but all other sub-fields required. Would an option to include all other sub-fields without explicitly defining be useful?

@@ -50,29 +49,56 @@ def warn(message):
print(message)


ecs_options = ['fields', 'enabled', 'index']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would idea here be that we could set enabled and index per subset if we wanted instead of having to do it globally in the custom schema files for example like we are doing here: https://github.com/elastic/endpoint-package/blob/master/custom_schemas/custom_endpoint.yml#L28

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and when merging subsets together the field is enabled in the index if it is enabled in any of the subsets. This also lets us easily disable indexing on ECS fields.

@jonathan-buttner
Copy link
Contributor

Thanks for putting this PR up @marshallmain this will definitely be helpful for endpoint.

BREAKING: Remove support for missing fields keys being treated as equivalent to fields: ''. Instead, fields is required as a key for any field that has sub-fields in the schema (object or nested), and fields is required NOT to be a key for fields that don't have sub-fields. This makes it easier to support custom options on individual fields since we don't have cases where fields may or may not be an option. IMO this is also a more intuitive format since leaf fields will no longer have the awkward fields: '' option and objects will clearly show if they are including all subfields.

Could you talk about fields: '*' vs the use of {}? I just want to make sure I understand when we'd use one over the other going forward. I see in your example we have it for "@timestamp": {} so would we only use for leaf fields?

Thanks @marshallmain! I like the top-level additions of name and field and removing the leaf fields containing the fields: "*" option. A couple of comments:

  • For directory organization it may be easier to place all individual subset directories into one common subset directory (e.g. <out>/generated/subset/<name_one>/, <out>/generated/subset/<name_two>/, etc.).

Having it go to a subset directory sounds like a good idea too

  • Once a custom option is defined on a field in a subset, the generated output only contains the defined field(s). I don't believe this behavior was introduced with this PR - just an observation.

For example declaring the following in the subset:

---
name: host_event
fields:
  host:
    fields:
      name:
        exceptionable: true

Filters all host fields except for host.name in the generated artifacts.

Imagine a scenario with only one sub-field using a custom option but all other sub-fields required. Would an option to include all other sub-fields without explicitly defining be useful?

hmm yeah that might be useful, or maybe we could have a flag define the behavior?

@marshallmain
Copy link
Contributor Author

@jonathan-buttner yeah {} would only be used for leaf fields and would leave all default options. 'fields': '*' would be the way to include all sub-fields with default options. I see the * notation as a convenience that makes it quick to get started, but I think for our endpoint use case we want the fine grained control over each field so we'll be moving away from using *.

It would be nice to be able to include all other sub fields, I haven't tried to implement it yet though. I find that the subsets I make fall into 2 distinct categories. Starting out I use fields: '*' heavily just to get something that works done quickly, and I'm not worrying about the options on individual fields much. Once the frame of the subset is done and I'm defining options on individual fields then I find it useful to have all the sub fields listed explicitly so I don't have to flip back and forth between the subset and the schema files, even though the subset might have a more compact representation using an "include all other fields with defaults also" option.

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I noted a couple of minor import-related follow-ups.

@@ -2,6 +2,8 @@
import glob
import os
import yaml
import copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these import statements end up unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this one was unnecessary, I moved the copying from generator.py to subset_filter.py

@@ -1,28 +1,26 @@
import glob
import yaml
import copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here - did this import end up unneeded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'm not seeing any usage of the copy library here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this. Will remove.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this, Marshall!

Yeah I'm good with breaking changes on --subset. This is so experimental that it's still not documented, other than comments in #746 ;-) And while the --subset feature is great already, I agree it can still be improved substantially.

I love how this sets the stage to automate the maintenance of many custom templates at once.

I'd like to question part of the philosophy of the approach, however. If I understand correctly, you're using the subset file to add attributes to the fields? I'd like to avoid doing that, as I think this will be confusing. I'd rather keep "subset" as strictly a filtering mechanism to determine which fields are included and excluded in the output.

Since the #864 rewrite, the --include mechanism no longer validates custom YAML files prior to merging. Original ECS fields + the included custom fields are all read & merged, prior to doing any validation. I redesigned it this way to allow modifying existing ECS fields without having to duplicate mandatory field attributes like before. In other words, if you want to add exceptionable: true to an existing ECS field, you should be able to do this with a very minimal custom file, e.g.:

- name: event
  fields:
    - name: action
      exceptionable: true

In case the current implementation breaks on unknown attributes, it's fine to adapt it so this works.

I'd like to react to 2 comments I've seen here.

The complete intermediate file will not contain any custom fields

By this do you mean "any custom attributes"? (e.g. like exceptionable). I'd be open to the intermediate files containing custom attributes, as a matter of fact. This will not affect the "official" ECS files, so I don't think this introduce a burden on ECS. If this flexibility is useful when generating custom artifacts, I'm all for it.

Filters all host fields except for host.name in the generated artifacts

@ebeahan I'm not 100% sure what exceptionable is, but I don't think it's about this process of filtering which fields are output in the artifacts. My understanding is that exceptionable is an actual field attribute that ultimately makes sense in the final Beats field definition YAML file.

A few additional thoughts:

  • Instead of fieldname: {} for leaf fields, how about we support this? fields: [field1, field2]. That would be much less wordy.
  • Subset currently acts as an allow list. But within an allowed grouping of fields, I'd like to be able to remove sections as well.
    • Example: I want all of the log.* fields except log.syslog.*

Comment on lines 47 to 53
for subset in subsets:
subfields = subset_filter.extract_matching_fields(fields, subset['fields'])
intermediate_files.generate(subfields, os.path.join(out_dir, 'ecs', 'subset', subset['name']), default_dirs)

merged_subset = subset_filter.combine_all_subsets(subsets)
if merged_subset:
fields = subset_filter.extract_matching_fields(fields, merged_subset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep all subset-related functionality like this loop inside subset_filter.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refactor this

es_template.generate(flat, ecs_version, out_dir, args.template_settings, args.mapping_settings)
beats.generate(nested, ecs_version, out_dir)
if args.include or args.subset:
exit()

csv_generator.generate(flat, ecs_version, out_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the CSV before the customization's early exit.

I'd much rather have a flag that lets the user pick which artifact they want generated (could be done here, or as another PR).

Examples:

  • Current behaviour when there's customizations: --artifacts csv,beats,intermediate
  • What the Endpoint team wants: --artifacts beats,intermediate

WDYT?

Note: if you look into adding this flag to the PR, you can keep things simple and continue ignoring asciidoc for now. In customization cases, I don't want to have to support generating customized docs.

Copy link
Contributor Author

@marshallmain marshallmain Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this because certain subsets can break the csv generation and cause the tooling to crash (if the subset sets enabled: false on an object that isn't explicitly listed, like process.thread, then that field is no longer only an intermediate field but it doesn't have all the required fields for csv generation). For now I can add the necessary attributes for csv generation to a field when a subset changes the field from intermediate: true to intermediate: false, and then move the csv generation back to where it was.

We don't have an immediate need for a flag to restrict the artifacts that are generated, our Makefile removes the ones we don't need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept of an --artifacts and think it'd be a good addition to incorporate eventually. Opened #885 to capture

@@ -195,6 +195,8 @@ def merge_fields(a, b):
asd['reusable']['top_level'] = bsd['reusable']['top_level']
else:
asd['reusable'].setdefault('top_level', True)
if 'order' in bsd['reusable']:
asd['reusable']['order'] = bsd['reusable']['order']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endpoint customizations have chained reuses like group => user => other places?

In any case 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we reuse process at Target.process and since hash is reused in process we need a way to either set the order for hash to 1 or the order for process to 3

@@ -1,28 +1,26 @@
import glob
import yaml
import copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'm not seeing any usage of the copy library here.

@marshallmain
Copy link
Contributor Author

Thanks Mat! I initially tried adding the custom attributes to the schema files themselves, but it's useful to have different values per subset for a custom attribute on a particular field. For example, our events types coming from the endpoint include process among others. For process events, we want to set an attribute like include: required on most of the process.* fields. However, for a different type of event (e.g. file or network) we want to use the same process fields from the schema but set the attribute as include: optional since they would likely be applicable to some but not all events of that type.

In this approach the schema files define the full repository of available fields and their types, which ensures that all documents that use the schemas use consistent field names and types. The subset files then pick specific fields from the schema and apply attributes that don't need to be consistent between all documents that use the schemas.

By this do you mean "any custom attributes"? (e.g. like exceptionable). I'd be open to the intermediate files containing custom attributes, as a matter of fact. This will not affect the "official" ECS files, so I don't think this introduce a burden on ECS. If this flexibility is useful when generating custom artifacts, I'm all for it.

Yeah, I meant custom attributes there. I avoided including custom attributes in the final ECS intermediate files because the "official" files are generated after merging subsets together, and if 2 subsets specified different values for the same custom attribute we wouldn't know how to merge the attributes. Additionally, in our use case we've found it useful to have the intermediate files for each individual subset even before custom attributes were introduced - so automatically generating all of the intermediate files for each subset and dropping the custom attributes in satisfies both needs simultaneously.

@ebeahan I'm not 100% sure what exceptionable is, but I don't think it's about this process of filtering which fields are output in the artifacts. My understanding is that exceptionable is an actual field attribute that ultimately makes sense in the final Beats field definition YAML file.

Yeah exceptionable is just an example of a custom attribute - it's the attribute we are adding to our endpoint subsets. I think the example he gave was demonstrating how in the current subset notation you can't specify some subfields with custom attributes and then say "and include all other subfields with default options". The choices are include all subfields with default options (fields: '*') or explicitly list out all the subfields you want, which can be a bit cumbersome.

Improvements like "and include all other subfields with default options" would fit in with your last 2 bullet points regarding allowing fields: [field1, field2] and the ability to subset by excluding fields rather than including them. I've personally felt the pain of listing out almost all subfields with default options so I definitely agree those would be great improvements in the future.

@ebeahan
Copy link
Member

ebeahan commented Jul 7, 2020

Yeah exceptionable is just an example of a custom attribute - it's the attribute we are adding to our endpoint subsets. I think the example he gave was demonstrating how in the current subset notation you can't specify some subfields with custom attributes and then say "and include all other subfields with default options". The choices are include all subfields with default options (fields: '*') or explicitly list out all the subfields you want, which can be a bit cumbersome.

Yes - thanks @marshallmain for clarifying 😄

@ebeahan
Copy link
Member

ebeahan commented Jul 10, 2020

If I understand correctly, you're using the subset file to add attributes to the fields? I'd like to avoid doing that, as I think this will be confusing. I'd rather keep "subset" as strictly a filtering mechanism to determine which fields are included and excluded in the output.

I agree that limiting --subset to filtering would be less confusing. However, I do like the simplicity of the single file to edit which defines the subset filtering and attribute customizations, and since --subset supports multiple subset files to be passed in, the functionality prevents the generator from having to be called multiple times for different subset outputs.

I think we move forward with introducing the functionality under --subset, but we can consider how to better future support of customizations/filtering across all the generator features. Perhaps something like a single configuration file, where the subset would an option strictly for filtering?

Instead of fieldname: {} for leaf fields, how about we support this? fields: [field1, field2]. That would be much less wordy.

Also agree this would be a more precise syntax. Not a blocker here but a good future enhancement.

ebeahan
ebeahan previously approved these changes Jul 10, 2020
@marshallmain
Copy link
Contributor Author

I don't have the ability to merge anymore - @ebeahan if this LGTM can you hit the button?

@ebeahan
Copy link
Member

ebeahan commented Jul 13, 2020

@marshallmain one item I overlooked earlier - can you add an entry to CHANGELOG.next.md?

Otherwise LGTM, and we'll get this merged.

@ebeahan ebeahan merged commit d18afed into elastic:master Jul 13, 2020
ebeahan added a commit to ebeahan/ecs that referenced this pull request Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants