Generate outputs for each individual subset with custom options #873

marshallmain · 2020-06-18T20:50:25Z

The main purpose of this PR is to produce intermediate generated files where the fields can be extended with custom options in subset files. As a user, I can then write a script to read the intermediate files and generate other necessary files from them. To achieve this, some (small) breaking changes to the subset format are needed.

This PR makes some changes and improvements to the subset format. While there are breaking changes, updating subsets to match the changes is easy. It would probably also be good to mark the subset feature as experimental.

BREAKING: Remove support for missing fields keys being treated as equivalent to fields: '*'. Instead, fields is required as a key for any field that has sub-fields in the schema (object or nested), and fields is required NOT to be a key for fields that don't have sub-fields. This makes it easier to support custom options on individual fields since we don't have cases where fields may or may not be an option. IMO this is also a more intuitive format since leaf fields will no longer have the awkward fields: '*' option and objects will clearly show if they are including all subfields.
BREAKING: adds name and fields as top level fields in subsets, rather than having the top level schema fields be the top level keys. The name is used as the directory to write individual intermediate generated files to. fields is added to keep the fields (what used to be the entire subset file) separate from the name.
When the --subset option is used each subset will produce intermediate generated files individually before the subsets are merged to produce the complete intermediate files. When creating the individual versions, any custom options on subset fields are added to the schema and therefore show up in the generated files. This is useful for annotating fields that have certain properties outside of elasticsearch - for example, we want to keep track of which fields are valid for users to key on when creating rule exceptions so we'll add a custom option exceptionable: true to those fields in the subset, and that will show up in the generated file as well.
The complete intermediate file will not contain any custom fields because we don't know what semantics to use when merging if a custom option is present on a field in both subsets.

Example field, annotated with exceptionable: true:

event.action:
  dashed_name: event-action
  description: 'The action captured by the event.

    This describes the information in the event. It is more specific than `event.category`.
    Examples are `group-add`, `process-started`, `file-created`. The value is normally
    defined by the implementer.'
  example: user-password-change
  exceptionable: true
  flat_name: event.action
  ignore_above: 1024
  level: core
  name: action
  normalize: []
  short: The action captured by the event.
  type: keyword

Subset that would create this field:

---
name: malware_event
fields:
  base:
    fields:
      "@timestamp": {}
  agent:
    fields: "*"
  dll:
    fields: "*"
  ecs:
    fields: "*"
  event:
    fields:
      action:
        exceptionable: true

ebeahan · 2020-06-24T18:35:50Z

Thanks @marshallmain! I like the top-level additions of name and field and removing the leaf fields containing the fields: "*" option. A couple of comments:

For directory organization it may be easier to place all individual subset directories into one common subset directory (e.g. <out>/generated/subset/<name_one>/, <out>/generated/subset/<name_two>/, etc.).
Once a custom option is defined on a field in a subset, the generated output only contains the defined field(s). I don't believe this behavior was introduced with this PR - just an observation.

For example declaring the following in the subset:

---
name: host_event
fields:
  host:
    fields:
      name:
        exceptionable: true

Filters all host fields except for host.name in the generated artifacts.

Imagine a scenario with only one sub-field using a custom option but all other sub-fields required. Would an option to include all other sub-fields without explicitly defining be useful?

jonathan-buttner · 2020-06-25T13:33:14Z

scripts/schema/subset_filter.py

@@ -50,29 +49,56 @@ def warn(message):
    print(message)


+ecs_options = ['fields', 'enabled', 'index']


So would idea here be that we could set enabled and index per subset if we wanted instead of having to do it globally in the custom schema files for example like we are doing here: https://github.com/elastic/endpoint-package/blob/master/custom_schemas/custom_endpoint.yml#L28

Yeah, and when merging subsets together the field is enabled in the index if it is enabled in any of the subsets. This also lets us easily disable indexing on ECS fields.

jonathan-buttner · 2020-06-25T13:39:23Z

Thanks for putting this PR up @marshallmain this will definitely be helpful for endpoint.

BREAKING: Remove support for missing fields keys being treated as equivalent to fields: ''. Instead, fields is required as a key for any field that has sub-fields in the schema (object or nested), and fields is required NOT to be a key for fields that don't have sub-fields. This makes it easier to support custom options on individual fields since we don't have cases where fields may or may not be an option. IMO this is also a more intuitive format since leaf fields will no longer have the awkward fields: '' option and objects will clearly show if they are including all subfields.

Could you talk about fields: '*' vs the use of {}? I just want to make sure I understand when we'd use one over the other going forward. I see in your example we have it for "@timestamp": {} so would we only use for leaf fields?

Thanks @marshallmain! I like the top-level additions of name and field and removing the leaf fields containing the fields: "*" option. A couple of comments:

For directory organization it may be easier to place all individual subset directories into one common subset directory (e.g. <out>/generated/subset/<name_one>/, <out>/generated/subset/<name_two>/, etc.).

Having it go to a subset directory sounds like a good idea too

Once a custom option is defined on a field in a subset, the generated output only contains the defined field(s). I don't believe this behavior was introduced with this PR - just an observation.

For example declaring the following in the subset:
---
name: host_event
fields:
  host:
    fields:
      name:
        exceptionable: true
Filters all host fields except for host.name in the generated artifacts.

Imagine a scenario with only one sub-field using a custom option but all other sub-fields required. Would an option to include all other sub-fields without explicitly defining be useful?

hmm yeah that might be useful, or maybe we could have a flag define the behavior?

marshallmain · 2020-06-25T15:06:59Z

@jonathan-buttner yeah {} would only be used for leaf fields and would leave all default options. 'fields': '*' would be the way to include all sub-fields with default options. I see the * notation as a convenience that makes it quick to get started, but I think for our endpoint use case we want the fine grained control over each field so we'll be moving away from using *.

It would be nice to be able to include all other sub fields, I haven't tried to implement it yet though. I find that the subsets I make fall into 2 distinct categories. Starting out I use fields: '*' heavily just to get something that works done quickly, and I'm not worrying about the options on individual fields much. Once the frame of the subset is done and I'm defining options on individual fields then I find it useful to have all the sub fields listed explicitly so I don't have to flip back and forth between the subset and the schema files, even though the subset might have a more compact representation using an "include all other fields with defaults also" option.

ebeahan

Looks good! I noted a couple of minor import-related follow-ups.

ebeahan · 2020-06-29T14:35:40Z

scripts/generator.py

@@ -2,6 +2,8 @@
 import glob
 import os
 import yaml
+import copy


Did these import statements end up unnecessary?

Yeah this one was unnecessary, I moved the copying from generator.py to subset_filter.py

ebeahan · 2020-06-29T14:37:10Z

scripts/schema/subset_filter.py

@@ -1,28 +1,26 @@
 import glob
 import yaml
+import copy


Also here - did this import end up unneeded?

Agree, I'm not seeing any usage of the copy library here.

Oops, missed this. Will remove.

webmat

Thanks for opening this, Marshall!

Yeah I'm good with breaking changes on --subset. This is so experimental that it's still not documented, other than comments in #746 ;-) And while the --subset feature is great already, I agree it can still be improved substantially.

I love how this sets the stage to automate the maintenance of many custom templates at once.

I'd like to question part of the philosophy of the approach, however. If I understand correctly, you're using the subset file to add attributes to the fields? I'd like to avoid doing that, as I think this will be confusing. I'd rather keep "subset" as strictly a filtering mechanism to determine which fields are included and excluded in the output.

Since the #864 rewrite, the --include mechanism no longer validates custom YAML files prior to merging. Original ECS fields + the included custom fields are all read & merged, prior to doing any validation. I redesigned it this way to allow modifying existing ECS fields without having to duplicate mandatory field attributes like before. In other words, if you want to add exceptionable: true to an existing ECS field, you should be able to do this with a very minimal custom file, e.g.:

- name: event
  fields:
    - name: action
      exceptionable: true

In case the current implementation breaks on unknown attributes, it's fine to adapt it so this works.

I'd like to react to 2 comments I've seen here.

The complete intermediate file will not contain any custom fields

By this do you mean "any custom attributes"? (e.g. like exceptionable). I'd be open to the intermediate files containing custom attributes, as a matter of fact. This will not affect the "official" ECS files, so I don't think this introduce a burden on ECS. If this flexibility is useful when generating custom artifacts, I'm all for it.

Filters all host fields except for host.name in the generated artifacts

@ebeahan I'm not 100% sure what exceptionable is, but I don't think it's about this process of filtering which fields are output in the artifacts. My understanding is that exceptionable is an actual field attribute that ultimately makes sense in the final Beats field definition YAML file.

A few additional thoughts:

Instead of fieldname: {} for leaf fields, how about we support this? fields: [field1, field2]. That would be much less wordy.
Subset currently acts as an allow list. But within an allowed grouping of fields, I'd like to be able to remove sections as well.
- Example: I want all of the log.* fields except log.syslog.*

webmat · 2020-07-03T17:46:27Z

scripts/generator.py

+    for subset in subsets:
+        subfields = subset_filter.extract_matching_fields(fields, subset['fields'])
+        intermediate_files.generate(subfields, os.path.join(out_dir, 'ecs', 'subset', subset['name']), default_dirs)
+
+    merged_subset = subset_filter.combine_all_subsets(subsets)
+    if merged_subset:
+        fields = subset_filter.extract_matching_fields(fields, merged_subset)


Please keep all subset-related functionality like this loop inside subset_filter.py.

I'll refactor this

webmat · 2020-07-03T17:51:14Z

scripts/generator.py

    es_template.generate(flat, ecs_version, out_dir, args.template_settings, args.mapping_settings)
    beats.generate(nested, ecs_version, out_dir)
    if args.include or args.subset:
        exit()

+    csv_generator.generate(flat, ecs_version, out_dir)


Please keep the CSV before the customization's early exit.

I'd much rather have a flag that lets the user pick which artifact they want generated (could be done here, or as another PR).

Examples:

Current behaviour when there's customizations: --artifacts csv,beats,intermediate

What the Endpoint team wants: --artifacts beats,intermediate

WDYT?

Note: if you look into adding this flag to the PR, you can keep things simple and continue ignoring asciidoc for now. In customization cases, I don't want to have to support generating customized docs.

I moved this because certain subsets can break the csv generation and cause the tooling to crash (if the subset sets enabled: false on an object that isn't explicitly listed, like process.thread, then that field is no longer only an intermediate field but it doesn't have all the required fields for csv generation). For now I can add the necessary attributes for csv generation to a field when a subset changes the field from intermediate: true to intermediate: false, and then move the csv generation back to where it was.

We don't have an immediate need for a flag to restrict the artifacts that are generated, our Makefile removes the ones we don't need.

I like the concept of an --artifacts and think it'd be a good addition to incorporate eventually. Opened #885 to capture

webmat · 2020-07-03T17:55:00Z

scripts/schema/loader.py

@@ -195,6 +195,8 @@ def merge_fields(a, b):
                    asd['reusable']['top_level'] = bsd['reusable']['top_level']
                else:
                    asd['reusable'].setdefault('top_level', True)
+                if 'order' in bsd['reusable']:
+                    asd['reusable']['order'] = bsd['reusable']['order']


Endpoint customizations have chained reuses like group => user => other places?

In any case 👍

Yeah, we reuse process at Target.process and since hash is reused in process we need a way to either set the order for hash to 1 or the order for process to 3

webmat · 2020-07-03T17:56:29Z

scripts/schema/subset_filter.py

@@ -1,28 +1,26 @@
 import glob
 import yaml
+import copy


Agree, I'm not seeing any usage of the copy library here.

marshallmain · 2020-07-03T20:45:04Z

Thanks Mat! I initially tried adding the custom attributes to the schema files themselves, but it's useful to have different values per subset for a custom attribute on a particular field. For example, our events types coming from the endpoint include process among others. For process events, we want to set an attribute like include: required on most of the process.* fields. However, for a different type of event (e.g. file or network) we want to use the same process fields from the schema but set the attribute as include: optional since they would likely be applicable to some but not all events of that type.

In this approach the schema files define the full repository of available fields and their types, which ensures that all documents that use the schemas use consistent field names and types. The subset files then pick specific fields from the schema and apply attributes that don't need to be consistent between all documents that use the schemas.

By this do you mean "any custom attributes"? (e.g. like exceptionable). I'd be open to the intermediate files containing custom attributes, as a matter of fact. This will not affect the "official" ECS files, so I don't think this introduce a burden on ECS. If this flexibility is useful when generating custom artifacts, I'm all for it.

Yeah, I meant custom attributes there. I avoided including custom attributes in the final ECS intermediate files because the "official" files are generated after merging subsets together, and if 2 subsets specified different values for the same custom attribute we wouldn't know how to merge the attributes. Additionally, in our use case we've found it useful to have the intermediate files for each individual subset even before custom attributes were introduced - so automatically generating all of the intermediate files for each subset and dropping the custom attributes in satisfies both needs simultaneously.

@ebeahan I'm not 100% sure what exceptionable is, but I don't think it's about this process of filtering which fields are output in the artifacts. My understanding is that exceptionable is an actual field attribute that ultimately makes sense in the final Beats field definition YAML file.

Yeah exceptionable is just an example of a custom attribute - it's the attribute we are adding to our endpoint subsets. I think the example he gave was demonstrating how in the current subset notation you can't specify some subfields with custom attributes and then say "and include all other subfields with default options". The choices are include all subfields with default options (fields: '*') or explicitly list out all the subfields you want, which can be a bit cumbersome.

Improvements like "and include all other subfields with default options" would fit in with your last 2 bullet points regarding allowing fields: [field1, field2] and the ability to subset by excluding fields rather than including them. I've personally felt the pain of listing out almost all subfields with default options so I definitely agree those would be great improvements in the future.

ebeahan · 2020-07-07T16:34:57Z

Yeah exceptionable is just an example of a custom attribute - it's the attribute we are adding to our endpoint subsets. I think the example he gave was demonstrating how in the current subset notation you can't specify some subfields with custom attributes and then say "and include all other subfields with default options". The choices are include all subfields with default options (fields: '*') or explicitly list out all the subfields you want, which can be a bit cumbersome.

Yes - thanks @marshallmain for clarifying 😄

…date

ebeahan · 2020-07-10T22:55:53Z

If I understand correctly, you're using the subset file to add attributes to the fields? I'd like to avoid doing that, as I think this will be confusing. I'd rather keep "subset" as strictly a filtering mechanism to determine which fields are included and excluded in the output.

I agree that limiting --subset to filtering would be less confusing. However, I do like the simplicity of the single file to edit which defines the subset filtering and attribute customizations, and since --subset supports multiple subset files to be passed in, the functionality prevents the generator from having to be called multiple times for different subset outputs.

I think we move forward with introducing the functionality under --subset, but we can consider how to better future support of customizations/filtering across all the generator features. Perhaps something like a single configuration file, where the subset would an option strictly for filtering?

Instead of fieldname: {} for leaf fields, how about we support this? fields: [field1, field2]. That would be much less wordy.

Also agree this would be a more precise syntax. Not a blocker here but a good future enhancement.

marshallmain · 2020-07-13T18:25:30Z

I don't have the ability to merge anymore - @ebeahan if this LGTM can you hit the button?

ebeahan · 2020-07-13T19:49:11Z

@marshallmain one item I overlooked earlier - can you add an entry to CHANGELOG.next.md?

Otherwise LGTM, and we'll get this merged.

marshallmain added 5 commits June 18, 2020 14:07

generate outputs for each individual subset

e5a770c

linting

1fc7e5b

Merge branch 'master' into subset-format-update

d4b2d1b

add enabled and index as official subset options

516bff9

Add tests for subset field custom options

55bb510

allow reusable order to be overridden by custom schemas

83703dc

marshallmain mentioned this pull request Jun 25, 2020

Add fields for file unquarantine message elastic/endpoint-package#16

Merged

jonathan-buttner reviewed Jun 25, 2020

View reviewed changes

Move subset generated files

b26e028

webmat added the review label Jun 25, 2020

marshallmain mentioned this pull request Jun 25, 2020

Update subsets to conform with new format elastic/endpoint-package#18

Merged

marshallmain requested review from webmat and ebeahan June 26, 2020 16:32

ebeahan reviewed Jun 29, 2020

View reviewed changes

jonathan-buttner previously approved these changes Jun 29, 2020

View reviewed changes

Properly mark intermediate fields when custom options are added

b11cb0b

marshallmain dismissed jonathan-buttner’s stale review via b11cb0b July 1, 2020 19:08

ebeahan mentioned this pull request Jul 2, 2020

Document the usage of the ECS generator #746

Closed

webmat reviewed Jul 3, 2020

View reviewed changes

marshallmain added 2 commits July 9, 2020 12:47

Address review comments

e078458

Leave existing descriptions

2dd496c

This was referenced Jul 9, 2020

Usage improvements #884

Merged

Add --artifacts option for generator.py #885

Closed

marshallmain added 2 commits July 9, 2020 16:27

Fix test

2bd97c9

Merge branch 'master' of github.com:elastic/ecs into subset-format-up…

89c079f

…date

ebeahan previously approved these changes Jul 10, 2020

View reviewed changes

Update changelog

d4ccd32

marshallmain dismissed ebeahan’s stale review via d4ccd32 July 13, 2020 20:13

ebeahan approved these changes Jul 13, 2020

View reviewed changes

ebeahan merged commit d18afed into elastic:master Jul 13, 2020

ebeahan added a commit to ebeahan/ecs that referenced this pull request Jul 22, 2020

update subset with improvements from elastic#873

ed1a16d

ebeahan mentioned this pull request Jul 22, 2020

Usage documentation improvements #893

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate outputs for each individual subset with custom options #873

Generate outputs for each individual subset with custom options #873

marshallmain commented Jun 18, 2020 •

edited

Loading

ebeahan commented Jun 24, 2020

jonathan-buttner Jun 25, 2020

marshallmain Jun 25, 2020

jonathan-buttner commented Jun 25, 2020

marshallmain commented Jun 25, 2020

ebeahan left a comment

ebeahan Jun 29, 2020

marshallmain Jun 29, 2020

ebeahan Jun 29, 2020

webmat Jul 3, 2020

marshallmain Jul 3, 2020

webmat left a comment

webmat Jul 3, 2020

marshallmain Jul 3, 2020

webmat Jul 3, 2020

marshallmain Jul 3, 2020 •

edited

Loading

ebeahan Jul 9, 2020

webmat Jul 3, 2020

marshallmain Jul 3, 2020

webmat Jul 3, 2020

marshallmain commented Jul 3, 2020

ebeahan commented Jul 7, 2020

ebeahan commented Jul 10, 2020

marshallmain commented Jul 13, 2020

ebeahan commented Jul 13, 2020

		@@ -50,29 +49,56 @@ def warn(message):
		print(message)


		ecs_options = ['fields', 'enabled', 'index']

Generate outputs for each individual subset with custom options #873

Generate outputs for each individual subset with custom options #873

Conversation

marshallmain commented Jun 18, 2020 • edited Loading

ebeahan commented Jun 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathan-buttner commented Jun 25, 2020

marshallmain commented Jun 25, 2020

ebeahan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marshallmain Jul 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marshallmain commented Jul 3, 2020

ebeahan commented Jul 7, 2020

ebeahan commented Jul 10, 2020

marshallmain commented Jul 13, 2020

ebeahan commented Jul 13, 2020

marshallmain commented Jun 18, 2020 •

edited

Loading

marshallmain Jul 3, 2020 •

edited

Loading