This repo contains my Master's thesis project in Computer Science & Engineering.
You can find the thesis PDF here.
Developed a novel NLP pipeline to automatically generate tags for texts, improving discoverability and organization on the OpenML platform.
- Data Preparation: Conducted in-depth exploratory data analysis and data augmentation to improve input quality.
- Advanced NLP Techniques: Integrated LLaMA-3-70b LLM for prompt-based tag generation and a DeBERTa-based zeroshot classifier for tag filtering, achieving nuanced and context-aware tagging.
- Model Optimization: Extended and optimized the BERTopic model with advanced embedding (Salesforce/SFR-Embedding-2_R) and dimensionality reduction (UMAP) techniques, fine-tuned using Bayesian optimization.
- Automated Evaluation: Achieved a combined NPMI and diversity score of 0.779, outperforming established baselines (LDA, NMF, Top2Vec, CTM) in automated evaluations.
- Human Evaluation: Conducted a user study (n=21) and large-scale automated evaluation (using GPT-4-mini), demonstrating superior performance compared to the baseline and approaching human-level results on multiple metrics.
- Cost-Effective Pipeline: Developed a computationally cheaper pipeline configuration maintaining acceptable tag quality.
To set up the environment and install the necessary dependencies, please follow these steps.
Make sure you have Anaconda or Miniconda installed on your system.
-
Create a new environment using the
environment.yml
file:conda env create -f environment.yml
-
Activate the environment:
conda activate openml-tags
-
Verify the environment is working as expected by running:
conda list
At this point, all required packages should be installed, and you can start using the repository and running the notebooks in the notebooks
directory.