The goal of building a machine learning algorithm is to predict accurately in the real world. Therefore, data scientists have created a process that is an approximation for how the algorithm will perform in real-world circumstances. This process is called data partitioning — allowing the model to test itself on data that it has not "seen" before.
Why is data partitioning important?
Datasets used to train machine-learning algorithms often don't fully represent the real circumstances in which it will be making predictions.
There are many reasons why a dataset that an analyst is working with may not reflect the real world:
How the data was collected. Data could have been collected in a non-representative manner.
The time that data was collected might be non-representive due to important changes that may happen between the time of data collection and model deployment
Degrees of randomness could exist in the system that makes data collected at one point in time non-representative of the system at a later point
The quality of the data could be inherently biased due to histories of discrimination. Racist systems result in racist datasets. Discriminatory systems will result in discriminatory datasets.**
The output variable could be a representation of a discriminatory bias, conscious or unconscious. For example, an algorithm to determine a "good employee" might be using whether or not a person was promoted in their first year as the output variable. This variable cannot be separated from the social power dynamics of racism, sexism, heterosexism, cissexsim, etc. in the workplace.**
** It is important to note that data partitioning will not solve this issue, but is listed here because it is a reason why a dataset may not reflect the real world.
How is data partitioning done?
Data partitioning is a process that measures how well an algorithm trained on one dataset will perform on another. The entire dataset is split, or partitioned, into sections — a "training set", "validation set", and "test set."
Analysts will train the model on one part of the dataset and will test on the remaining portion. Data partitioning is used as a proxy to measure the model's performance on data that it hasn't seen before. The validation set is typically used during the model candidate selection and hyperparameter tuning process. The test set is used to evaluate a model at the very end of the process.
When partitioning, the analyst has to decide how much of the data to allocate to the training set on one hand and the validation and test set on the other. Should there be a 75/25 split? 50/50? 90/10?
Suppose more data is given to the training set. In that case, an algorithm stands a better chance of learning useful correlations because it has a greater number and variety of subjects from which to learn. Though, if the test set is too small, the analyst won't have a good idea of how well the algorithm performance generalizes to data in the real-world.
This decision is guided mainly by these factors:
The size of the entire starting dataset. A larger dataset tends to weigh in favor of a more even split between training and test sets. Suppose there are hundreds of thousands of observations. In that case, it is unlikely (in most contexts) that, by allocating half to the test dataset, the training dataset would be deprived of examples from which the algorithm could learn. If the dataset is considerably smaller, an analyst might proceed by assigning more observations to the training set.
The summary statistics of the outcome variable. Summary statistics could identify cases that are extremely rare yet crucial to the correctness of the model. An analyst might want to ensure that the training set includes examples of this rare outcome by allocating more of the data to the training set.
Subject-matter knowledge about the domain in which an algorithm is applied. If the environment is relatively stable, they may allocate more data to the training set. Suppose the environment is changing at a rapid pace. In that case, it might make more sense to allocate more data to the test set, assuming the model will capture some of the data's critical dynamism in the test data variation.
Developing a well-trained algorithm is the utmost goal in any supervised machine-learning context. Methods like cross-validation exist to get around the issues that arose in this article. Briefly, cross-validation is a method by which the model is trained and tested multiple times. Each time, the testing and training sets are allocated differently. After various rounds of this, an average is calculated.
After the data is partitioned, the analyst tests the partition through selecting model candidates, training them, and evaluating their performance.
If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that.