Skip to content

ivangermanov/openml-tags

Repository files navigation

This repo contains my Master's thesis project in Computer Science & Engineering.

You can find the thesis PDF here.

Summary

Developed a novel NLP pipeline to automatically generate tags for texts, improving discoverability and organization on the OpenML platform.

  • Data Preparation: Conducted in-depth exploratory data analysis and data augmentation to improve input quality.
  • Advanced NLP Techniques: Integrated LLaMA-3-70b LLM for prompt-based tag generation and a DeBERTa-based zeroshot classifier for tag filtering, achieving nuanced and context-aware tagging.
  • Model Optimization: Extended and optimized the BERTopic model with advanced embedding (Salesforce/SFR-Embedding-2_R) and dimensionality reduction (UMAP) techniques, fine-tuned using Bayesian optimization.
  • Automated Evaluation: Achieved a combined NPMI and diversity score of 0.779, outperforming established baselines (LDA, NMF, Top2Vec, CTM) in automated evaluations.
  • Human Evaluation: Conducted a user study (n=21) and large-scale automated evaluation (using GPT-4-mini), demonstrating superior performance compared to the baseline and approaching human-level results on multiple metrics.
  • Cost-Effective Pipeline: Developed a computationally cheaper pipeline configuration maintaining acceptable tag quality.

Environment Setup

To set up the environment and install the necessary dependencies, please follow these steps.

Prerequisites

Make sure you have Anaconda or Miniconda installed on your system.

Steps

  1. Create a new environment using the environment.yml file:

    conda env create -f environment.yml
  2. Activate the environment:

    conda activate openml-tags
  3. Verify the environment is working as expected by running:

    conda list

At this point, all required packages should be installed, and you can start using the repository and running the notebooks in the notebooks directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages