-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate outputs for each individual subset with custom options #873
Conversation
Thanks @marshallmain! I like the top-level additions of
For example declaring the following in the subset: ---
name: host_event
fields:
host:
fields:
name:
exceptionable: true Filters all Imagine a scenario with only one sub-field using a custom option but all other sub-fields required. Would an option to include all other sub-fields without explicitly defining be useful? |
@@ -50,29 +49,56 @@ def warn(message): | |||
print(message) | |||
|
|||
|
|||
ecs_options = ['fields', 'enabled', 'index'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So would idea here be that we could set enabled
and index
per subset if we wanted instead of having to do it globally in the custom schema files for example like we are doing here: https://github.com/elastic/endpoint-package/blob/master/custom_schemas/custom_endpoint.yml#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and when merging subsets together the field is enabled in the index if it is enabled in any of the subsets. This also lets us easily disable indexing on ECS fields.
Thanks for putting this PR up @marshallmain this will definitely be helpful for endpoint.
Could you talk about
Having it go to a
hmm yeah that might be useful, or maybe we could have a flag define the behavior? |
@jonathan-buttner yeah It would be nice to be able to include all other sub fields, I haven't tried to implement it yet though. I find that the subsets I make fall into 2 distinct categories. Starting out I use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I noted a couple of minor import
-related follow-ups.
scripts/generator.py
Outdated
@@ -2,6 +2,8 @@ | |||
import glob | |||
import os | |||
import yaml | |||
import copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did these import
statements end up unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this one was unnecessary, I moved the copying from generator.py to subset_filter.py
scripts/schema/subset_filter.py
Outdated
@@ -1,28 +1,26 @@ | |||
import glob | |||
import yaml | |||
import copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here - did this import
end up unneeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I'm not seeing any usage of the copy library here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, missed this. Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this, Marshall!
Yeah I'm good with breaking changes on --subset
. This is so experimental that it's still not documented, other than comments in #746 ;-) And while the --subset
feature is great already, I agree it can still be improved substantially.
I love how this sets the stage to automate the maintenance of many custom templates at once.
I'd like to question part of the philosophy of the approach, however. If I understand correctly, you're using the subset file to add attributes to the fields? I'd like to avoid doing that, as I think this will be confusing. I'd rather keep "subset" as strictly a filtering mechanism to determine which fields are included and excluded in the output.
Since the #864 rewrite, the --include
mechanism no longer validates custom YAML files prior to merging. Original ECS fields + the included custom fields are all read & merged, prior to doing any validation. I redesigned it this way to allow modifying existing ECS fields without having to duplicate mandatory field attributes like before. In other words, if you want to add exceptionable: true
to an existing ECS field, you should be able to do this with a very minimal custom file, e.g.:
- name: event
fields:
- name: action
exceptionable: true
In case the current implementation breaks on unknown attributes, it's fine to adapt it so this works.
I'd like to react to 2 comments I've seen here.
The complete intermediate file will not contain any custom fields
By this do you mean "any custom attributes"? (e.g. like exceptionable). I'd be open to the intermediate files containing custom attributes, as a matter of fact. This will not affect the "official" ECS files, so I don't think this introduce a burden on ECS. If this flexibility is useful when generating custom artifacts, I'm all for it.
Filters all host fields except for host.name in the generated artifacts
@ebeahan I'm not 100% sure what exceptionable is, but I don't think it's about this process of filtering which fields are output in the artifacts. My understanding is that exceptionable
is an actual field attribute that ultimately makes sense in the final Beats field definition YAML file.
A few additional thoughts:
- Instead of
fieldname: {}
for leaf fields, how about we support this?fields: [field1, field2]
. That would be much less wordy. - Subset currently acts as an allow list. But within an allowed grouping of fields, I'd like to be able to remove sections as well.
- Example: I want all of the
log.*
fields exceptlog.syslog.*
- Example: I want all of the
scripts/generator.py
Outdated
for subset in subsets: | ||
subfields = subset_filter.extract_matching_fields(fields, subset['fields']) | ||
intermediate_files.generate(subfields, os.path.join(out_dir, 'ecs', 'subset', subset['name']), default_dirs) | ||
|
||
merged_subset = subset_filter.combine_all_subsets(subsets) | ||
if merged_subset: | ||
fields = subset_filter.extract_matching_fields(fields, merged_subset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep all subset-related functionality like this loop inside subset_filter.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll refactor this
scripts/generator.py
Outdated
es_template.generate(flat, ecs_version, out_dir, args.template_settings, args.mapping_settings) | ||
beats.generate(nested, ecs_version, out_dir) | ||
if args.include or args.subset: | ||
exit() | ||
|
||
csv_generator.generate(flat, ecs_version, out_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the CSV before the customization's early exit.
I'd much rather have a flag that lets the user pick which artifact they want generated (could be done here, or as another PR).
Examples:
- Current behaviour when there's customizations:
--artifacts csv,beats,intermediate
- What the Endpoint team wants:
--artifacts beats,intermediate
WDYT?
Note: if you look into adding this flag to the PR, you can keep things simple and continue ignoring asciidoc for now. In customization cases, I don't want to have to support generating customized docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this because certain subsets can break the csv generation and cause the tooling to crash (if the subset sets enabled: false
on an object that isn't explicitly listed, like process.thread
, then that field is no longer only an intermediate
field but it doesn't have all the required fields for csv generation). For now I can add the necessary attributes for csv generation to a field when a subset changes the field from intermediate: true
to intermediate: false
, and then move the csv generation back to where it was.
We don't have an immediate need for a flag to restrict the artifacts that are generated, our Makefile removes the ones we don't need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the concept of an --artifacts
and think it'd be a good addition to incorporate eventually. Opened #885 to capture
@@ -195,6 +195,8 @@ def merge_fields(a, b): | |||
asd['reusable']['top_level'] = bsd['reusable']['top_level'] | |||
else: | |||
asd['reusable'].setdefault('top_level', True) | |||
if 'order' in bsd['reusable']: | |||
asd['reusable']['order'] = bsd['reusable']['order'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Endpoint customizations have chained reuses like group => user => other places?
In any case 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we reuse process
at Target.process
and since hash
is reused in process
we need a way to either set the order for hash
to 1
or the order for process
to 3
scripts/schema/subset_filter.py
Outdated
@@ -1,28 +1,26 @@ | |||
import glob | |||
import yaml | |||
import copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I'm not seeing any usage of the copy library here.
Thanks Mat! I initially tried adding the custom attributes to the schema files themselves, but it's useful to have different values per subset for a custom attribute on a particular field. For example, our events types coming from the endpoint include In this approach the schema files define the full repository of available fields and their types, which ensures that all documents that use the schemas use consistent field names and types. The subset files then pick specific fields from the schema and apply attributes that don't need to be consistent between all documents that use the schemas.
Yeah, I meant custom attributes there. I avoided including custom attributes in the final ECS intermediate files because the "official" files are generated after merging subsets together, and if 2 subsets specified different values for the same custom attribute we wouldn't know how to merge the attributes. Additionally, in our use case we've found it useful to have the intermediate files for each individual subset even before custom attributes were introduced - so automatically generating all of the intermediate files for each subset and dropping the custom attributes in satisfies both needs simultaneously.
Yeah Improvements like "and include all other subfields with default options" would fit in with your last 2 bullet points regarding allowing |
Yes - thanks @marshallmain for clarifying 😄 |
I agree that limiting I think we move forward with introducing the functionality under
Also agree this would be a more precise syntax. Not a blocker here but a good future enhancement. |
I don't have the ability to merge anymore - @ebeahan if this LGTM can you hit the button? |
@marshallmain one item I overlooked earlier - can you add an entry to Otherwise LGTM, and we'll get this merged. |
The main purpose of this PR is to produce intermediate generated files where the fields can be extended with custom options in subset files. As a user, I can then write a script to read the intermediate files and generate other necessary files from them. To achieve this, some (small) breaking changes to the subset format are needed.
This PR makes some changes and improvements to the subset format. While there are breaking changes, updating subsets to match the changes is easy. It would probably also be good to mark the subset feature as experimental.
fields
keys being treated as equivalent tofields: '*'
. Instead,fields
is required as a key for any field that has sub-fields in the schema (object
ornested
), andfields
is required NOT to be a key for fields that don't have sub-fields. This makes it easier to support custom options on individual fields since we don't have cases wherefields
may or may not be an option. IMO this is also a more intuitive format since leaf fields will no longer have the awkwardfields: '*'
option and objects will clearly show if they are including all subfields.name
andfields
as top level fields in subsets, rather than having the top level schema fields be the top level keys. Thename
is used as the directory to write individual intermediate generated files to.fields
is added to keep the fields (what used to be the entire subset file) separate from thename
.--subset
option is used each subset will produce intermediate generated files individually before the subsets are merged to produce the complete intermediate files. When creating the individual versions, any custom options on subset fields are added to the schema and therefore show up in the generated files. This is useful for annotating fields that have certain properties outside of elasticsearch - for example, we want to keep track of which fields are valid for users to key on when creating rule exceptions so we'll add a custom optionexceptionable: true
to those fields in the subset, and that will show up in the generated file as well.Example field, annotated with
exceptionable: true
:Subset that would create this field: