At Tune.ai, we support a variety of dataset formats to facilitate the fine-tuning process. While various formats are accepted, we recommend using JSONL (JSON Lines) for its flexibility and ease of use. However, you can also utilize HuggingFace datasets with designated columns to structure your data effectively.

JSONL Schema:

The schema of the JSONL depends on the task and the prompt template you wish to use. Below are examples of different dataset formats:

  1. Completion:

    [{"text": "..."},    {"text": "..."}...]
    
  2. Instruct:

    [{"instruction": "...", "input": "...", "output": "..."},
    {"instruction": "...", "input": "...", "output": "..."}...]
    // Instruction is the task LLM needs to do
    // Input is the input to the task
    // Output is the expected generation
    
  3. Chat:

    [
        {"conversations": [{"from": "system", "value": "..."}]},
        {"conversations": [{"from": "human", "value": "..."}]},
        {"conversations": [{"from": "gpt", "value": "..."}]},
        {"conversations": [{"from": "human", "value": "..."}]},
        {"conversations": [{"from": "gpt", "value": "..."}]},
        ...
    ]
    // "from" can only be one of "system", "human", "gpt"
    // There can only be 1 "system" in starting of the conversation
    // After that, "human" and "gpt" alternate every message starting with human
    

Our platform provides built-in functionality to convert chat history threads into the chat format dataset. This means you can directly utilize existing conversation threads, such as those from messaging platforms or forums, without the need for manual conversion. ThreadsAPI

Crafting Your Dataset:

Creating a high-quality dataset is crucial for effective fine-tuning. Here are some tips to consider:

  • Include Diverse Examples: Ensure your dataset covers various scenarios, including cases where the model’s behavior needs improvement.
  • Provide Ideal Responses: Assistant messages in the dataset should represent the desired responses you want the model to generate.
  • Repeat Instructions: Consider including instructions or prompts that worked well before fine-tuning in every training example to achieve optimal and general results.
  • Consider Cost and Length: Shorten repeated instructions or prompts if necessary to save costs but remember that the model may still behave as if they were included.
  • More Training Examples: It may require a larger number of training examples for the model to learn effectively through demonstration alone.
  • Split into Training and Test Sets: Divide your dataset into training and test portions to evaluate model performance effectively. This allows for better monitoring of model improvement and ensures you can evaluate the model post-training.

By following these guidelines, you can create a robust dataset tailored to your fine-tuning needs, setting the stage for successful model optimization and deployment.