dirty_cat has migrated to skrub . This repository will no longer be maintained.
Use skrub, it has all the features of dirty-cat and more.
dirty_cat was a Python library to facilitate machine-learning on dirty categorical variables.
Its functionalities are merged in the skrub
For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].
dirty_cat provides tools (TableVectorizer
, fuzzy_join
...) and
encoders (GapEncoder
, MinHashEncoder
...) for morphological similarities,
for which we usually identify three common cases: similarities, typos and variations
The first example notebook goes in-depth on how to identify and deal with dirty data using the dirty_cat library.
Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.
This kind of problem is tackled by Natural Language Processing methods.
dirty_cat can still help with handling typos and variations in this kind of setting.
Please do not use dirty-cat anymore, but rather skrub, which has the same features, replaces dirty-cat and can be easily installed via pip:
pip install skrub
Dependencies and minimal versions are listed in the setup file.
If you want to encourage development of these functionality, the best thing to do is to spread the word around skrub
And please contribute to skrub
[1] | Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer. |
[2] | Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering. |