Training Data
What is training data?
Training data is the dataset used to train a machine learning model or large language model, teaching it patterns and relationships that it will use to make predictions or generate responses. In the context of LLMs like ChatGPT, training data consists of vast amounts of text from the internet and other sources that the model learns from during its initial training phase.
Training data determines what a model "knows" at its foundation—the baseline knowledge baked into the model's weights during training. This is distinct from information dynamically retrieved via RAG (Retrieval Augmented Generation) at query time.
How is training data different from other types of data?
AI development involves three distinct types of data:
Training data teaches the model patterns during the initial training phase. For LLMs, this typically includes massive internet corpora. For specialized machine learning models, it's more targeted data relevant to the specific prediction task.
Test data is held out from training and used to evaluate whether the model generalizes well to unseen examples. This separation is critical for validating that the model hasn't simply memorized training examples.
Production data is what the model encounters during actual use—real user inputs and interactions after deployment.
Preventing leakage between these datasets is essential for building reliable models.
What is data leakage in training data?
Data leakage occurs when information that shouldn't be available during training accidentally gets included in the training data. A common example: when building a forecasting model, accidentally including future outcomes in the training data creates unrealistically good performance during development but fails in production.
For LLMs, data leakage can happen when test data contaminates training data, making it impossible to accurately evaluate model performance. Careful data cleaning and curation prevents these issues and ensures models generalize well to unseen inputs.
Learn more:
- Don't Use Generative AI to Replace Discovery with Real Humans
- Turning Disruption into Opportunity: The Stack Overflow AI Story with Ellen Brandenberger
- Debugging AI Products: From Data Leakage to Evals with Hamel Husain
Related terms: