dirty_cat

dirty_cat has migrated to skrub . This repository will no longer be maintained.

Use skrub, it has all the features of dirty-cat and more.

Do not use dirty_cat, but rather the skrub package

dirty_cat was a Python library to facilitate machine-learning on dirty categorical variables.

Its functionalities are merged in the skrub

Dirty categories

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

What can dirty_cat do?

dirty_cat provides tools (TableVectorizer, fuzzy_join...) and encoders (GapEncoder, MinHashEncoder...) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

The first example notebook goes in-depth on how to identify and deal with dirty data using the dirty_cat library.

What dirty_cat does not

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

dirty_cat can still help with handling typos and variations in this kind of setting.

Installation

Please do not use dirty-cat anymore, but rather skrub, which has the same features, replaces dirty-cat and can be easily installed via pip:

pip install skrub

Dependencies

Dependencies and minimal versions are listed in the setup file.

Related projects

skrub

Contributing

If you want to encourage development of these functionality, the best thing to do is to spread the word around skrub

And please contribute to skrub

Additional resources

References

[1]	Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.

[2]	Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 1,347 Commits
.binder		.binder
.circleci		.circleci
.github		.github
benchmarks		benchmarks
build_tools		build_tools
dirty_cat		dirty_cat
doc		doc
examples		examples
.coveragerc		.coveragerc
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGES.rst		CHANGES.rst
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
RELEASE_PROCESS.rst		RELEASE_PROCESS.rst
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dirty_cat

Do not use dirty_cat, but rather the skrub package

Dirty categories

What can dirty_cat do?

What dirty_cat does not

Installation

Dependencies

Related projects

Contributing

Additional resources

References

About

Releases

Packages

Languages

License

dirty-cat/dirty_cat

Folders and files

Latest commit

History

Repository files navigation

dirty_cat

Do not use dirty_cat, but rather the skrub package

Dirty categories

What can dirty_cat do?

What dirty_cat does not

Installation

Dependencies

Related projects

Contributing

Additional resources

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages