Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341

Merged
merged 28 commits into from
Oct 29, 2021

Conversation

murali1996
Copy link
Contributor

@murali1996 murali1996 commented Jul 19, 2021

Update:

  • Based on some discussion, calling .fit() in nlp's load process seems un-intuitive and hence necessary modifications are done in entity_resolvers.py and nlp.py to address this issue. See the related comments here.
  • Pending discussion on-- whether KB should be cached by NativeQA or should be data be loaded everytime QA is instantiated?
  • Using the .dump() and load() in NLP pipeline leads to creation of new files-- .config.pkl and .pkl.hash for every resolver created in the WxA. Pending discussion on this as well.


This PR adds new functionality to entity resolvers (and thereby the question answerers built on top of them) to dump their state and load back from the dumped state. In addition to this end use, few disk-space optimizations are also added so that the KB data is not cached by both QA as well as the underlying resolvers. Lastly, minor bugs were fixed in entity resolvers and more comments/type-hints are added.

Once approved, details will be added to Mindmeld docs for users to take advantage of the resolver dump and load methods.

Problem:

Before this PR, if someone wishes to use a NativeQuestionAnswerer (i.e. the non-Elasticsearch QA), one should always call the .load_kb() method and fit the underlying resolver models to use the QA for search/inference. This behavior is unlike ElasticsearchQA which on other hand needs load-kb called only once. This is because the ElasticsearchQA loads back indices when the .get() and .build_search() methods are called directly. In cases wherein the data as well as the input configurations haven't changed from a previous run, of-course the NativeQuestionAnswerer need not fit the resolvers if there's a way to dump the resolvers' states in the previous run.

Solution:

  • In this PR, dump and load functionality are first added to the entity resolvers. This needed few changes in the resolver path (path.py file), ability for an embedder model in embedder based resolvers to dump its embeddings cache in a path derived from app_path(related changes also in embedder_models.py file).
  • In order to keep the NLP pipeline unaltered, a minor change has been made to the nlp.py file. Previously, the load methods of resolvers simply redirected to the fit method. So the load method calls to entity resolver has been replaced with fit method in the nlp.py file.
  • Note that the NLP pipeline doesn't use the dump functionality of the resolvers as it leads to creation of more data files on the disk. This is implemented keeping both WxA as well as end users in mind.
  • Because QA uses the resolver's dump functionality, some changes were implemented in the question_answerer.py file to encapsulate that and resolve the original problem of this PR.

Discussions:

Backwards compatibility of the NLP pipeline

  • The NLP pipeline is mostly left unaltered by the introduction of dump and load methods of resolvers. The only change being that instead of calling self.entity_resolver.load(), we call self.entity_resolver.fit(clean=False) in the load method of EntityProcessor.
  • No issues/backward inconsistencies arise for ExactMatch, TFIDF and EmbedderCosSim/SentenceBert resolvers due to this change as they previously were simply calling .fit in their .load methods anyways.
  • However, for the Elasticsearch resolver, that too only when the data or resolver configurations have changed, there is a backwards inconsistency (for good reasons!). Previously w/ Elasticsearch resolver when we called load method of the resolver, it checks if the index exists or not, if not then fits the resolver with latest data and configurations. Now, the resolver is fit with latest data and configurations irrespective of index's existence. So this slightly increases the loading time of the resolver as we are re-fitting/re-ingesting synonyms when we call ElasticsearchResolver.load().
  • The only corner case is when a user creates a resolver index with specific configurations and now calls nlp pipeline with a different resolver configuration, which again is very unlikely. And when user changes data, it is always better to ingest the newer data into the index.

@murali1996 murali1996 requested review from vembar and vijay120 July 19, 2021 19:08
vembar
vembar previously approved these changes Jul 20, 2021
@murali1996 murali1996 marked this pull request as draft July 26, 2021 18:02
@murali1996 murali1996 self-assigned this Aug 4, 2021
… for dumping and loading; few bug fixes and some log info modifications; separated Factory class from QuestionAnswerer class
@murali1996 murali1996 changed the title Address Issue #340 (TypeError: can't pickle sqlite3.Connection objects) Feature/Dumping feature for QuestionAnswerers Aug 5, 2021
@murali1996 murali1996 requested a review from vembar August 5, 2021 10:27
Copy link
Contributor

@vijay120 vijay120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. I am still a bit unclear on why people are trying to dump QA models when they can just load_kb it. Is it because load_kb takes a long time?

@murali1996 murali1996 changed the title Feature/Dumping feature for QuestionAnswerers Few fixes related to QuestionAnswerers Aug 21, 2021
@murali1996
Copy link
Contributor Author

A few comments. I am still a bit unclear on why people are trying to dump QA models when they can just load_kb it. Is it because load_kb takes a long time?

Based on our offline discussion regarding this PR and some analysis around the runtimes, we decided to add the following changes to QA module as part of this PR to bring native QA usability akin to elasticsearch:

  • Dump models built as part of native QA's .load_kb() method to the path ~/.cache/mindmeld, so that users don't have to call.load_kb() in every environment before querying indices.
  • Implement a new method _load_field_info() as part of native QA, similar to its functionality in elasticsearch QA, to load the previously dumped models when directly calling .get() and .build_search() methods w/o calling .load_kb() method in an user environment.
  • Add notes in docs informing users that for replicating the state of the QA module when using native QA, especially when using in a deployment environment, they'll have to copy model dump files in ~/.cache/mindmeld.

incremental_model_path if incremental_timestamp else model_path
)
self.entity_resolver.load()
self.entity_resolver.fit(clean=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder do we need this for backward compatibility reasons? I think in general we would prefer not to do fit() when loading the models so that the behavior across NLP pipeline is more consistent and predictable. (so that users won't be surprised by any kind of model building happens during load()) I think the changes in this PR help us move forward on to that direction since we don't need to do fit() in the resolver implementation of load() anymore and we should keep it that way if we can?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I absolutely agree to you point Marvin. I did so keeping in mind about the WxA; I realized that if we dump and load resolver models (exact match in WxA case), we will be using more disk space as we need to dump the entity map (i.e. a processed KB data) in the dump() call and load it back in the load() call of NLP pipeline. So I would like to confirm if extra disk space is a problem in case of WxA before making the change you suggested. @vijay120 @vembar @mhuang2173

Copy link
Contributor Author

@murali1996 murali1996 Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: It is more logical to not to dump a copy of the KB data when we call dump() for a resolver. In the latest push, I added 'entity_map' argument in the load() method to address this. So what happens now is that, the KB data is loaded in the load() method as well similar to fit() method and hence I can replace 'self.entity_resolver.fit(clean=False)'with 'self.entity_resolver.load()'. Also in case of question answerers, we can pass the KB data looked up the question answerers directly to the load() method.

"""

# dump underlying resolver model/algorithm/embeddings
self._dump(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Maybe we should add a message to the user consistent with other models - "Saving entity resolver: domain=<>, intent=<>, entity=<>"

Copy link
Contributor

@mhuang2173 mhuang2173 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, a few more comments

incremental_model_path if incremental_timestamp else model_path
)
self.entity_resolver.load()
except ElasticsearchConnectionError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might need to replace this one with a more general exception class?

Copy link
Contributor Author

@murali1996 murali1996 Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a more generic exception (except EntityResolverError) in the entity_resolver.py module and removed the exception from here. Is there a specific reason why did we decided to pass this ElasticsearchConnectionError here without raising an error? (I found it as part of our code base since a long time) @mhuang2173 . Also, in case of resolvers without any training data, this error will not be raised due to some checks that we already have in the entity_resolver.py

t["entity"] for t in
self.text_preparation_pipeline.tokenize_and_normalize(text)
]

def dump(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need a _dump() for embedder model specific logic?

Also there seems to be some overlapping between the generic embedding cache and the Glove specific cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So embedder models' dump functionality is just dumping and loading a cache object;, and no configs are dumped. Because all embedder classes have the cache in the same way, I guess we don't need an embedder specific dump? @mhuang2173

And for the overlapping of dump for Glove, I would like to take that up in PR 325 with some more modifications to the Glove embedder class.

_resource_loader = NativeQuestionAnswerer.get_resource_loader()
for field_name, field_resource in index_resources.items():
field_resource.update_resource(
id2value={},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why empty dict here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Related [comment]#341 (comment))

So by passing id2value as en empty dict, we can trigger loading of resolvers. But because it doesn't make a good code style, I now modified the code to have two separate methods- update_resource and load_resource, the latter taking care of only loading resolvers for FieldResources. So this line 877 that you pointed no longer exists.

@mhuang2173 @vijay120

if os.path.exists(resolver_config_path):
with open(resolver_config_path, "rb") as fp:
self.resolver_configurations = pickle.load(fp)
fp.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@murali1996 murali1996 merged commit 48fa926 into cisco:master Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants