Fine-Tuning Qwen 3:0.6B for Question Categorization

Fine-Tuning Qwen 3:0.6B for Question Categorization

Overview

Fine-tuning a tiny local LLM can transform it from an unreliable classifier into a high-performance tool for metadata-aware RAG. By fine-tuning Qwen 3:0.6B using the Unsloth framework and QLoRA, a developer increased categorization accuracy from a 10% baseline to 92% for household-related questions.

The Role of Categorization in RAG

Question categorization serves as a pre-processing step to improve the precision of Retrieval-Augmented Generation (RAG). By mapping a user query to a specific metadata category (e.g., "pool", "hvac", "cooking"), the system can narrow the search space for vector ranking to only indexed entries that match that category. This prevents the model from retrieving irrelevant documents from other categories, thereby increasing the overall accuracy of the final answer.

Baseline Performance: Prompting Alone

Using the original Qwen 3:0.6B model without fine-tuning, the developer established a baseline using a strict prompt requiring the model to return only a category name from a provided list.

Baseline Results:

  • Accuracy: ~10% (13 correct out of 131 tests).
  • Failure Patterns: The model frequently overused broad labels like "electric" or "appliances" and often invented new categories (e.g., "apartments") that were not in the allowed list.

Fine-Tuning Strategy and Implementation

To move beyond the baseline, the developer utilized the Unsloth open-source framework with QLoRA for fine-tuning.

Dataset and Training

  • Dataset Size: Approximately 850 data entries.
  • Data Split: 70% training, 15% evaluation, and 15% test data.
  • Evaluation: A battery of 131 integration tests was used to measure performance post-training.

First Fine-Tuning Attempt

In the first attempt, the model was trained to output the category names directly.

  • Accuracy: 79% (104 correct out of 131 tests).
  • Remaining Issues: The model occasionally emitted fragments of categories (e.g., "ac" instead of "hvac") and struggled with semantically overlapping categories (e.g., "water heater" vs. "pool").

Optimizing Accuracy with Opaque IDs

Mapping categories to opaque, two-letter codes (e.g., "AA" for appliances, "KK" for hvac) significantly improved performance over using semantic category names. By removing the semantic overlap in the output tokens, the tiny model was better able to to map queries to a fixed, non-overlapping format.

Final Results:

  • Accuracy: 92% (120 correct out of 131 tests).
  • Key Finding: Asking for a fixed, non-overlapping output format helps small models maintain consistency and prevents them from hallucinating synonyms or fragments.

Remaining Challenges

Despite the 92% accuracy, some failures persisted, particularly where categories have overlapping meanings. For example, the model continued to misclassify "water heater" queries as "pool" queries due to the shared "watery" context. The author notes that further improvements will require more nuanced training data to better differentiate these specific semantic overlaps.

Sources