Fine-Tuning

Pretrained model for specific tasks:

Fine-tuning teaches LLM to understand new patterns in data:

Data importance in fine-tuning:

Types of Datasets

Supervised Fine-Tuning (SFT) Datasets

Preference Alignment Datasets

Challenges in creating data

Dataset Formats & Popular Datasets

Stanford Alpaca Dataset

alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields:

Examples of Open SFT Datasets

What is a good SFT dataset?

  1. Accuracy
    • Factually correct outputs
    • Minimal typos
    • Preserve model knowledge integrity
  2. Diversity
    • Cover a wide range of topics (use-case dependent)
    • Include various writing styles
  3. Complexity
    • Include complex tasks that force reasoning
    • Examples: chain-of-thought reasoning, summarization, “explain like I’m 5”

Creating SFT Datasets: A Recipe

image.png
  1. Start with open-source datasets (combine multiple datasets)
  2. Data deduplication and Decontamination This step is crucial to ensure the dataset doesn’t contain redundant information and isn’t contaminated with data that might be in the test set.
    • Exact deduplication: Remove identical samples with data normalization (e.g., convert text to lowercase), hash generation (e.g., create an MD5 or SHA-256 hash for each sample), and duplicate removal.
    • Fuzzy deduplication
      • MinHash: Fuzzy deduplication with hashing, sorting, and Jaccard similarity (preferred technique).
      • BLOOM filters: Fuzzy deduplication with hashing and fixed-size vector.
    • Decontamination: Remove samples too close to test sets, using either exact or fuzzy filtering.
  3. Data quality evaluation This step helps in filtering out low-quality or irrelevant data that could negatively impact model training.
    • Rule-based: Remove samples based on a list of unwanted words, like refusals and “As an AI assistant” (example).
    • LLM-as-a-judge: Colab notebook that provides code to rate outputs with Mixtral-7x8B.
    • Data Prep Kit: Framework for data preparation for both code and language, with modules in Python, Ray, and Spark, and a wide range of scale from laptops to data centers.
    • Argilla: Open-source data curation platform that allows you to filter and annotate datasets in a collaborative way.
  4. Data Generation This step is used to augment your dataset, especially if the initial dataset is small or lacks diversity.
    • Augmentoolkit: Framework to convert raw text into datasets using open-source and closed-source models.
    • Distilabel: General-purpose framework that can generate and augment data (SFT, DPO) with techniques like UltraFeedback and DEITA.
  5. Data Exploration: This step helps in understanding the composition and characteristics of your dataset using topic clustering and visualization, which can inform further refinement or generation steps.
    • Nomic Atlas: Interact with instructed data to find insights and store embeddings.
  6. Iterate: Use insights to generate more data and repeat the process

Best Practices

Leave a Reply

Your email address will not be published. Required fields are marked *

Need help?