Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorthand subset notation #805

Merged
merged 7 commits into from
Apr 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.next.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Thanks, you're awesome :-) -->

* Add support for reusing offical fieldsets in custom schemas. #751
* Add full path names to reused fieldsets in `nestings` array in ecs_nested.yml. #803
* Allow shorthand notation for including all subfields in subsets. #805

#### Deprecated

Expand Down
2 changes: 2 additions & 0 deletions scripts/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ def main():
with open(file) as f:
raw = yaml.safe_load(f.read())
ecs_helpers.recursive_merge_subset_dicts(subset, raw)
if not subset:
raise ValueError('Subset option specified but no subsets found')
intermediate_fields = ecs_helpers.fields_subset(subset, intermediate_fields)

(nested, flat) = schema_reader.generate_nested_flat(intermediate_fields)
Expand Down
16 changes: 10 additions & 6 deletions scripts/generators/ecs_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,25 +52,29 @@ def safe_merge_dicts(a, b):

def fields_subset(subset, fields):
retained_fields = {}
allowed_options = ['fields']
for key, val in subset.items():
# Every field must have a 'fields' key or the schema is invalid
if isinstance(val['fields'], dict):
for option in val:
if option not in allowed_options:
raise ValueError('Unsupported option found in subset: {}'.format(option))
# A missing fields key is shorthand for including all subfields
if 'fields' not in val or val['fields'] == '*':
Copy link
Contributor

@webmat webmat Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'd make the check a little more specific, to fail more explicitly if someone makes a typo.

If it's specifically an empty dictionary (rather than the absence of the key "fields"), then it's the shorthand to grab all subkeys.

Otherwise the following silently gets me all of host instead of just host.name:

base:
  fields: "*"
host:
  feels:
    name:
      fields: "*"

I think realistically, users will use this with many field sets (10-15?), so a subtle typo like that could go unnoticed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, how about a list of allowed keys to safeguard against that case instead? I haven't gotten around to it yet but I'd like to allow index as an option for each field and a list of allowed keys would let you do something like

host:
  index: false

to grab all the host fields with the shorthand notation while also excluding the potential mistyped fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something more strict like a whitelist of allowed keys in that "subset" structure sounds good, yeah.

I'm not sure I follow what index does here, though. Can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't know how I missed this, I thought I replied.

index here would correspond to the index property in the elasticsearch mapping. So this would let us specify fields from the schema that we want to exist in documents but don't need to be indexed. This particular use case may or may not be useful but it's something to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When saving a document in Elasticsearch, the raw document always remain complete and untouched.

The behaviour of index:false on field myfield simply does not create a searchable mapping entry for myfield. But if you were to read a document that contains myfield, it would be present in the _source of the doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misread your last comment. Sorry for explaining essentially what you just said 😂

Let's keep the whitelist to only accept "fields" for now, if you don't mind.

retained_fields[key] = fields[key]
elif isinstance(val['fields'], dict):
# Copy the full field over so we get all the options, then replace the 'fields' with the right subset
retained_fields[key] = fields[key]
retained_fields[key]['fields'] = fields_subset(val['fields'], fields[key]['fields'])
elif val['fields'] == '*':
retained_fields[key] = fields[key]
return retained_fields


def recursive_merge_subset_dicts(a, b):
for key in b:
if key not in a:
a[key] = b[key]
elif 'fields' not in a[key] or 'fields' not in b[key] or b[key]['fields'] == '*':
a[key]['fields'] = '*'
elif isinstance(a[key]['fields'], dict) and isinstance(b[key]['fields'], dict):
recursive_merge_subset_dicts(a[key]['fields'], b[key]['fields'])
elif b[key]['fields'] == "*":
a[key]['fields'] = b[key]['fields']


def yaml_ordereddict(dumper, data):
Expand Down
4 changes: 1 addition & 3 deletions scripts/tests/test_ecs_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,7 @@ def test_recursive_subset_merge(self):
}
}
},
'field2': {
'fields': '*'
}
'field2': {}
}
subset_b = {
'field1': {
Expand Down