Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341

murali1996 · 2021-07-19T19:04:28Z

Update:

Based on some discussion, calling .fit() in nlp's load process seems un-intuitive and hence necessary modifications are done in entity_resolvers.py and nlp.py to address this issue. See the related comments here.
Pending discussion on-- whether KB should be cached by NativeQA or should be data be loaded everytime QA is instantiated?
Using the .dump() and load() in NLP pipeline leads to creation of new files-- .config.pkl and .pkl.hash for every resolver created in the WxA. Pending discussion on this as well.

This PR adds new functionality to entity resolvers (and thereby the question answerers built on top of them) to dump their state and load back from the dumped state. In addition to this end use, few disk-space optimizations are also added so that the KB data is not cached by both QA as well as the underlying resolvers. Lastly, minor bugs were fixed in entity resolvers and more comments/type-hints are added.

Once approved, details will be added to Mindmeld docs for users to take advantage of the resolver dump and load methods.

Problem:

Before this PR, if someone wishes to use a NativeQuestionAnswerer (i.e. the non-Elasticsearch QA), one should always call the .load_kb() method and fit the underlying resolver models to use the QA for search/inference. This behavior is unlike ElasticsearchQA which on other hand needs load-kb called only once. This is because the ElasticsearchQA loads back indices when the .get() and .build_search() methods are called directly. In cases wherein the data as well as the input configurations haven't changed from a previous run, of-course the NativeQuestionAnswerer need not fit the resolvers if there's a way to dump the resolvers' states in the previous run.

Solution:

In this PR, dump and load functionality are first added to the entity resolvers. This needed few changes in the resolver path (path.py file), ability for an embedder model in embedder based resolvers to dump its embeddings cache in a path derived from app_path(related changes also in embedder_models.py file).
In order to keep the NLP pipeline unaltered, a minor change has been made to the nlp.py file. Previously, the load methods of resolvers simply redirected to the fit method. So the load method calls to entity resolver has been replaced with fit method in the nlp.py file.
Note that the NLP pipeline doesn't use the dump functionality of the resolvers as it leads to creation of more data files on the disk. This is implemented keeping both WxA as well as end users in mind.
Because QA uses the resolver's dump functionality, some changes were implemented in the question_answerer.py file to encapsulate that and resolve the original problem of this PR.

Discussions:

See this comment below.
Related Issues: 361

Backwards compatibility of the NLP pipeline

The NLP pipeline is mostly left unaltered by the introduction of dump and load methods of resolvers. The only change being that instead of calling self.entity_resolver.load(), we call self.entity_resolver.fit(clean=False) in the load method of EntityProcessor.
No issues/backward inconsistencies arise for ExactMatch, TFIDF and EmbedderCosSim/SentenceBert resolvers due to this change as they previously were simply calling .fit in their .load methods anyways.
However, for the Elasticsearch resolver, that too only when the data or resolver configurations have changed, there is a backwards inconsistency (for good reasons!). Previously w/ Elasticsearch resolver when we called load method of the resolver, it checks if the index exists or not, if not then fits the resolver with latest data and configurations. Now, the resolver is fit with latest data and configurations irrespective of index's existence. So this slightly increases the loading time of the resolver as we are re-fitting/re-ingesting synonyms when we call ElasticsearchResolver.load().
The only corner case is when a user creates a resolver index with specific configurations and now calls nlp pipeline with a different resolver configuration, which again is very unlikely. And when user changes data, it is always better to ingest the newer data into the index.

…tive QA class

… for dumping and loading; few bug fixes and some log info modifications; separated Factory class from QuestionAnswerer class

mindmeld/components/question_answerer.py

vijay120

A few comments. I am still a bit unclear on why people are trying to dump QA models when they can just load_kb it. Is it because load_kb takes a long time?

murali1996 · 2021-08-21T20:07:32Z

A few comments. I am still a bit unclear on why people are trying to dump QA models when they can just load_kb it. Is it because load_kb takes a long time?

Based on our offline discussion regarding this PR and some analysis around the runtimes, we decided to add the following changes to QA module as part of this PR to bring native QA usability akin to elasticsearch:

Dump models built as part of native QA's .load_kb() method to the path ~/.cache/mindmeld, so that users don't have to call.load_kb() in every environment before querying indices.
Implement a new method _load_field_info() as part of native QA, similar to its functionality in elasticsearch QA, to load the previously dumped models when directly calling .get() and .build_search() methods w/o calling .load_kb() method in an user environment.
Add notes in docs informing users that for replicating the state of the QA module when using native QA, especially when using in a deployment environment, they'll have to copy model dump files in ~/.cache/mindmeld.

…r adding dump functionality in resolvers

…ss and explicitly mention arguments list in BaseQuestionAnswerer

mindmeld/components/entity_resolver.py

mhuang2173 · 2021-09-27T21:41:01Z

mindmeld/components/nlp.py

-                incremental_model_path if incremental_timestamp else model_path
-            )
-            self.entity_resolver.load()
+            self.entity_resolver.fit(clean=False)


Just wonder do we need this for backward compatibility reasons? I think in general we would prefer not to do fit() when loading the models so that the behavior across NLP pipeline is more consistent and predictable. (so that users won't be surprised by any kind of model building happens during load()) I think the changes in this PR help us move forward on to that direction since we don't need to do fit() in the resolver implementation of load() anymore and we should keep it that way if we can?

Yeah, I absolutely agree to you point Marvin. I did so keeping in mind about the WxA; I realized that if we dump and load resolver models (exact match in WxA case), we will be using more disk space as we need to dump the entity map (i.e. a processed KB data) in the dump() call and load it back in the load() call of NLP pipeline. So I would like to confirm if extra disk space is a problem in case of WxA before making the change you suggested. @vijay120 @vembar @mhuang2173

Update: It is more logical to not to dump a copy of the KB data when we call dump() for a resolver. In the latest push, I added 'entity_map' argument in the load() method to address this. So what happens now is that, the KB data is loaded in the load() method as well similar to fit() method and hence I can replace 'self.entity_resolver.fit(clean=False)'with 'self.entity_resolver.load()'. Also in case of question answerers, we can pass the KB data looked up the question answerers directly to the load() method.

mindmeld/components/entity_resolver.py

vembar · 2021-09-27T23:14:21Z

mindmeld/components/entity_resolver.py

+        """
+
+        # dump underlying resolver model/algorithm/embeddings
+        self._dump(path)


Optional: Maybe we should add a message to the user consistent with other models - "Saving entity resolver: domain=<>, intent=<>, entity=<>"

mindmeld/components/entity_resolver.py

mhuang2173

looking good, a few more comments

mindmeld/components/entity_resolver.py

mhuang2173 · 2021-09-30T23:54:05Z

mindmeld/components/nlp.py

                incremental_model_path if incremental_timestamp else model_path
            )
-            self.entity_resolver.load()
        except ElasticsearchConnectionError:


i think we might need to replace this one with a more general exception class?

I added a more generic exception (except EntityResolverError) in the entity_resolver.py module and removed the exception from here. Is there a specific reason why did we decided to pass this ElasticsearchConnectionError here without raising an error? (I found it as part of our code base since a long time) @mhuang2173 . Also, in case of resolvers without any training data, this error will not be raised due to some checks that we already have in the entity_resolver.py

mhuang2173 · 2021-10-01T21:53:59Z

mindmeld/models/embedder_models.py

+            t["entity"] for t in
+            self.text_preparation_pipeline.tokenize_and_normalize(text)
+        ]
+
    def dump(self):


maybe we need a _dump() for embedder model specific logic?

Also there seems to be some overlapping between the generic embedding cache and the Glove specific cache?

So embedder models' dump functionality is just dumping and loading a cache object;, and no configs are dumped. Because all embedder classes have the cache in the same way, I guess we don't need an embedder specific dump? @mhuang2173

And for the overlapping of dump for Glove, I would like to take that up in PR 325 with some more modifications to the Glove embedder class.

mindmeld/components/question_answerer.py

mindmeld/components/entity_resolver.py

mindmeld/components/nlp.py

mindmeld/components/question_answerer.py

vijay120 · 2021-10-21T23:35:01Z

mindmeld/components/question_answerer.py

+                _resource_loader = NativeQuestionAnswerer.get_resource_loader()
+                for field_name, field_resource in index_resources.items():
+                    field_resource.update_resource(
+                        id2value={},


why empty dict here?

(Related [comment]#341 (comment))

So by passing id2value as en empty dict, we can trigger loading of resolvers. But because it doesn't make a good code style, I now modified the code to have two separate methods- update_resource and load_resource, the latter taking care of only loading resolvers for FieldResources. So this line 877 that you pointed no longer exists.

@mhuang2173 @vijay120

mindmeld/components/question_answerer.py

vijay120 · 2021-10-25T21:03:36Z

mindmeld/components/entity_resolver.py

+        if os.path.exists(resolver_config_path):
+            with open(resolver_config_path, "rb") as fp:
+                self.resolver_configurations = pickle.load(fp)
+                fp.close()


https://stackoverflow.com/questions/21275836/if-youre-opening-a-file-using-the-with-statement-do-you-still-need-to-close

Close() not required

murali1996 added 2 commits July 19, 2021 11:49

use resource_loader attr as property to allow for serialization of na…

7dd1d69

…tive QA class

use resource_loader attr as property to allow for serialization of na…

cc22d2e

…tive QA class

murali1996 requested review from vembar and vijay120 July 19, 2021 19:08

vembar previously approved these changes Jul 20, 2021

View reviewed changes

murali1996 marked this pull request as draft July 26, 2021 18:02

Merge remote-tracking branch 'upstream/master' into dump_qa_instances

c57714a

murali1996 self-assigned this Aug 4, 2021

murali1996 added 2 commits August 4, 2021 23:09

Merge branch 'master' into dump_qa_instances

08be11e

added dump and load methods, successfully validated simple test cases…

9b11762

… for dumping and loading; few bug fixes and some log info modifications; separated Factory class from QuestionAnswerer class

murali1996 changed the title ~~Address Issue #340 (TypeError: can't pickle sqlite3.Connection objects)~~ Feature/Dumping feature for QuestionAnswerers Aug 5, 2021

murali1996 requested a review from vembar August 5, 2021 10:27

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Aug 18, 2021

View reviewed changes

murali1996 mentioned this pull request Aug 21, 2021

model_type native for KB #360

Closed

murali1996 changed the title ~~Feature/Dumping feature for QuestionAnswerers~~ Few fixes related to QuestionAnswerers Aug 21, 2021

murali1996 mentioned this pull request Sep 13, 2021

TypeError: can't pickle sqlite3.Connection objects when pickling native QA instance #340

Closed

murali1996 added 3 commits September 14, 2021 15:11

minor changes to embedder models and entity resolver in preeration fo…

d4236cc

…r adding dump functionality in resolvers

merge upstream and resolve conflicts

e87f28b

remove resource_loader instances in QA, clean up QuestionAnswerer cla…

187935d

…ss and explicitly mention arguments list in BaseQuestionAnswerer

mhuang2173 reviewed Sep 27, 2021

View reviewed changes

vembar reviewed Sep 27, 2021

View reviewed changes

mindmeld/components/entity_resolver.py Outdated Show resolved Hide resolved

vembar reviewed Sep 27, 2021

View reviewed changes

mindmeld/components/entity_resolver.py Outdated Show resolved Hide resolved

murali1996 added 5 commits September 28, 2021 17:33

address PR comments; add unit tests for querying an unloaded KB

7b60852

revert setup.py

f41c034

added entity_map arg to load methods as well

7322d08

make changes in nlp pipeline or dumping and loading resolvers

9a22a89

fix minor os.makedirs issue

d0d10f7

mhuang2173 reviewed Oct 1, 2021

View reviewed changes

murali1996 mentioned this pull request Oct 5, 2021

Improve error messages for installing extra [bert] when using a QuestionAnswerer #348

Open

murali1996 added 3 commits October 12, 2021 09:01

address PR comments; improve log messages

b82154e

merge upstream/master

a19211c

resolve errors

0cb74e9

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/entity_resolver.py Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/entity_resolver.py Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/entity_resolver.py Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/nlp.py Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/question_answerer.py Outdated Show resolved Hide resolved

vijay120 reviewed Oct 21, 2021

View reviewed changes

mindmeld/components/question_answerer.py Show resolved Hide resolved

address PR reviews, few modifications to how dumped indices are loaded

f60b53e

vijay120 reviewed Oct 25, 2021

View reviewed changes

vijay120 approved these changes Oct 25, 2021

View reviewed changes

murali1996 added 3 commits October 28, 2021 21:24

Merge remote-tracking branch 'upstream/master' into dump_qa_instances

1aa16dc

final pushes before merging

6ac74c6

add pytest extras

8f6c1f4

murali1996 merged commit 48fa926 into cisco:master Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341

Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341

murali1996 commented Jul 19, 2021 •

edited

Loading

vijay120 left a comment

murali1996 commented Aug 21, 2021

mhuang2173 Sep 27, 2021

murali1996 Sep 28, 2021

murali1996 Sep 29, 2021 •

edited

Loading

vembar Sep 27, 2021

mhuang2173 left a comment

mhuang2173 Sep 30, 2021

murali1996 Oct 12, 2021 •

edited

Loading

mhuang2173 Oct 1, 2021

murali1996 Oct 11, 2021

vijay120 Oct 21, 2021

murali1996 Oct 24, 2021

vijay120 Oct 25, 2021

Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341

Add dump functionality to Entity Resolvers and NativeQuestionAnswerer #341

Conversation

murali1996 commented Jul 19, 2021 • edited Loading

Update:

Problem:

Solution:

Discussions:

Backwards compatibility of the NLP pipeline

vijay120 left a comment

Choose a reason for hiding this comment

murali1996 commented Aug 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

murali1996 Sep 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhuang2173 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

murali1996 Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

murali1996 commented Jul 19, 2021 •

edited

Loading

murali1996 Sep 29, 2021 •

edited

Loading

murali1996 Oct 12, 2021 •

edited

Loading