-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341
Conversation
… for dumping and loading; few bug fixes and some log info modifications; separated Factory class from QuestionAnswerer class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I am still a bit unclear on why people are trying to dump QA models when they can just load_kb it. Is it because load_kb takes a long time?
Based on our offline discussion regarding this PR and some analysis around the runtimes, we decided to add the following changes to QA module as part of this PR to bring native QA usability akin to elasticsearch:
|
…r adding dump functionality in resolvers
…ss and explicitly mention arguments list in BaseQuestionAnswerer
mindmeld/components/nlp.py
Outdated
incremental_model_path if incremental_timestamp else model_path | ||
) | ||
self.entity_resolver.load() | ||
self.entity_resolver.fit(clean=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder do we need this for backward compatibility reasons? I think in general we would prefer not to do fit()
when loading the models so that the behavior across NLP pipeline is more consistent and predictable. (so that users won't be surprised by any kind of model building happens during load()
) I think the changes in this PR help us move forward on to that direction since we don't need to do fit() in the resolver implementation of load() anymore and we should keep it that way if we can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I absolutely agree to you point Marvin. I did so keeping in mind about the WxA; I realized that if we dump and load resolver models (exact match in WxA case), we will be using more disk space as we need to dump the entity map (i.e. a processed KB data) in the dump() call and load it back in the load() call of NLP pipeline. So I would like to confirm if extra disk space is a problem in case of WxA before making the change you suggested. @vijay120 @vembar @mhuang2173
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: It is more logical to not to dump a copy of the KB data when we call dump() for a resolver. In the latest push, I added 'entity_map' argument in the load() method to address this. So what happens now is that, the KB data is loaded in the load() method as well similar to fit() method and hence I can replace 'self.entity_resolver.fit(clean=False)'
with 'self.entity_resolver.load()'
. Also in case of question answerers, we can pass the KB data looked up the question answerers directly to the load() method.
""" | ||
|
||
# dump underlying resolver model/algorithm/embeddings | ||
self._dump(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Maybe we should add a message to the user consistent with other models - "Saving entity resolver: domain=<>, intent=<>, entity=<>"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, a few more comments
mindmeld/components/nlp.py
Outdated
incremental_model_path if incremental_timestamp else model_path | ||
) | ||
self.entity_resolver.load() | ||
except ElasticsearchConnectionError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we might need to replace this one with a more general exception class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a more generic exception (except EntityResolverError) in the entity_resolver.py module and removed the exception from here. Is there a specific reason why did we decided to pass this ElasticsearchConnectionError here without raising an error? (I found it as part of our code base since a long time) @mhuang2173 . Also, in case of resolvers without any training data, this error will not be raised due to some checks that we already have in the entity_resolver.py
mindmeld/models/embedder_models.py
Outdated
t["entity"] for t in | ||
self.text_preparation_pipeline.tokenize_and_normalize(text) | ||
] | ||
|
||
def dump(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we need a _dump()
for embedder model specific logic?
Also there seems to be some overlapping between the generic embedding cache and the Glove specific cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So embedder models' dump functionality is just dumping and loading a cache object;, and no configs are dumped. Because all embedder classes have the cache in the same way, I guess we don't need an embedder specific dump? @mhuang2173
And for the overlapping of dump for Glove, I would like to take that up in PR 325 with some more modifications to the Glove embedder class.
_resource_loader = NativeQuestionAnswerer.get_resource_loader() | ||
for field_name, field_resource in index_resources.items(): | ||
field_resource.update_resource( | ||
id2value={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why empty dict here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Related [comment]#341 (comment))
So by passing id2value as en empty dict, we can trigger loading of resolvers. But because it doesn't make a good code style, I now modified the code to have two separate methods- update_resource and load_resource, the latter taking care of only loading resolvers for FieldResources. So this line 877 that you pointed no longer exists.
if os.path.exists(resolver_config_path): | ||
with open(resolver_config_path, "rb") as fp: | ||
self.resolver_configurations = pickle.load(fp) | ||
fp.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
This PR adds new functionality to entity resolvers (and thereby the question answerers built on top of them) to dump their state and load back from the dumped state. In addition to this end use, few disk-space optimizations are also added so that the KB data is not cached by both QA as well as the underlying resolvers. Lastly, minor bugs were fixed in entity resolvers and more comments/type-hints are added.
Once approved, details will be added to Mindmeld docs for users to take advantage of the resolver dump and load methods.
Problem:
Before this PR, if someone wishes to use a NativeQuestionAnswerer (i.e. the non-Elasticsearch QA), one should always call the
.load_kb()
method and fit the underlying resolver models to use the QA for search/inference. This behavior is unlike ElasticsearchQA which on other hand needs load-kb called only once. This is because the ElasticsearchQA loads back indices when the .get() and .build_search() methods are called directly. In cases wherein the data as well as the input configurations haven't changed from a previous run, of-course the NativeQuestionAnswerer need not fit the resolvers if there's a way to dump the resolvers' states in the previous run.Solution:
path.py
file), ability for an embedder model in embedder based resolvers to dump its embeddings cache in a path derived fromapp_path
(related changes also inembedder_models.py
file).nlp.py
file. Previously, the load methods of resolvers simply redirected to the fit method. So the load method calls to entity resolver has been replaced with fit method in the nlp.py file.question_answerer.py
file to encapsulate that and resolve the original problem of this PR.Discussions:
Backwards compatibility of the NLP pipeline
self.entity_resolver.load()
, we callself.entity_resolver.fit(clean=False)
in the load method ofEntityProcessor
.