The data has been analyzed and visualized. Now the data scientist has some idea about how they'll be transforming the data. In this article, we'll provide an overview of the various actions taken to transform and reduce a dataset. We'll present many of the tradeoffs that require analysts to make judgment calls based on domain knowledge.
Often, there is much more data than is relevant to the task. For example, suppose you're trying to analyze data from a particular location. In that case, you have to remove all data points about observations from other areas.
Through the data analysis and visualization stage, the analyst might have identified some incorrect data— like a person that is 400 years old or a ten year old who has a credit history. The analyst decides whether to remove the entire row representing all data points for this person or to delete the erroneous value.
Within a feature, variations can exist that need to be normalized before a machine could understand them as referring to the same thing. For instance, addresses are typically an example of inconsistent data. Some examples of variations in addresses are:
Contains observations with "Street," "St.," and some not containing the word "street" at all.
In the U.S., they might use a zipcode or a zip+4 code.
Some addresses include the country; others don't.
Another consideration within this topic is formatting. For example, a feature that represents a date might be in the format "11/12/21". Which date this refers to depends on whether the person who wrote this data was in a place where the format is month/day/year or day/month/year. Once identified, these issues are typically normalized during the data analysis and visualization stage since normalization is necessary to understand the reality represented by the information.
Sometimes, outliers can represent errors in measurement, inadequate data collection, or show variables not considered when collecting the data. Other times, outliers represent real cases in the world that are rare yet important for the algorithm's predictive capability. It is up to the analyst to decide what to do with them. Suppose the outliers are obviously incorrect, like a ten-year-old with credit card history. In that case, the analyst can choose to delete that row or impute a more reasonable value (more on this later).
Suppose the outlier represents a real-world phenomenon that the analyst deems necessary to include in the dataset. In that case, they may apply mathematical transformations that prevent the outlier from rendering the measures of central tendency (the mean, median, and mode) useless.
Datasets are rarely, if ever, free from missing and inaccurate values. Missing data is a common obstacle for an analyst trying to train a model that has reliable and correct predictions. The fewer values for a particular feature means that feature becomes less useful to the algorithm's predictive ability.
Typically if over half of the data is missing for a particular feature, it is deleted because the analyst can't assume what those values are, based on the information they do have. And it won't be that useful in the prediction that you're trying to make. Suppose you only have 25% of a particular feature. In that case, it is unlikely that you'll use that variable as a proxy for predicting an outcome.
Having enough observations is critical to machine learning's success, so this approach might bias the dataset. For example, people with very high salaries may not report it, which manifests as a missing value. Removing those observations from the dataset might deprive it of crucial information.
The analyst looks at the percentage of missing data for individual observations as well. With too little information in a single row, the quality of prediction power decreases.
Data imputation aims to replace missing data in features and observations so that they can be used to train the model. This process consists of replacing null values with an estimation that is determined by the analyst.
One of the risks is that the dataset can be changed too much, skewing the data away from reality. An analyst would likely be inserting some values that turn out to be wrong, which is likely to make the algorithm less accurate and less generalizable. It should be made clear that this step is not supposed to change the dataset. It is meant to be a tool that brings the dataset closer to the real-world environment in which it will be making predictions.
In some cases, the analyst may decide to replace missing values with measures of central tendency, meaning— the mean (the average), median (the middle), or mode (the most frequent). The issue with this technique is that it only works on the column level and misses correlations between features.
A more robust technique is called stratified replacement, which groups existing data points by a particular feature, like location. It assigns an estimate through imputation for missing values based on that grouping. This technique is useful for numerical features since it replaces missing values with the average of the group. For categorical features, the missing values are replaced with the mode.
This technique requires a more advanced process of clustering the data points. Likely, another machine learning model is used to impute the missing data.
For example, the k-nearest neighbors algorithm finds the most similar rows and averages their values. This approach assumes numerical data, not categorical. The analyst can alternatively use deep learning, an unsupervised technique. It works well for categorical data, but how the algorithm creates the grouping is an opaque process. Regression is another technique that finds linear or non-linear relationships between the missing feature and other features.
Data imputation is a space where the dataset can become biased away from reality based on the analyst's assumptions. Part of your FOI request can include:
what percentage of values were missing on the feature level and observation level
what methods were used for data imputation, and what was the reasoning behind this decision
Data transformation is the process of changing the format, structure, or values of data to train a machine learning model. Data transformation may be:
constructive— adding, copying, and replicating data
destructive— deleting fields and records
aesthetic— standardizing salutations or street names
structural— renaming, moving, and combining columns in a database
Lack of expertise and carelessness can introduce problems during transformation. Data analysts without appropriate subject matter expertise are less likely to notice typos or incorrect data because they are less familiar with the range of accurate and permissible values. For example, someone working on medical data who is unfamiliar with relevant terms might fail to flag disease names that should be mapped to a singular value or notice misspellings.
Some considerations during the data transformation process:
When combining multiple files and data sources, the analyst needs to ensure that the units are the same. For example, it might be necessary to normalize salary data using an exchange rate. In this case, it's essential to use the exchange rate from when the data was collected — otherwise, this is a space for error.
Some machine learning models like random forests or neural networks don't work well with text-based inputs. It is necessary to change the data from categorical to numerical through a process called codification. It's imperative to note here that even though a numerical score replaces nominal data, it doesn't mean that the data type has changed. The numerical score is a reference point — taking an average of the values won't be meaningful because the data it represents is still nominal.
Numerical features within a dataset typically are brought onto a standardized scale. Some machine learning algorithms will assume some attributes have more predictive power over others because of large numbers. For example, salary tends to have numerical values that are significantly larger than the number of children someone has. The numerical data is brought onto a standardized scale where the data's integrity remains intact. Another reason for doing this is that other statistical methods, like principal component analysis, require standardized data.
When joining data from multiple sources, the data scientist must be mindful of what data is added and how it affects the data distribution. It is up to the data scientist to decide a fair and reasonable concatenation of a dataset.
Feature construction involves transforming a given set of input features to generate a new set of more robust features, which can be used for prediction.
By constructing new features, the analyst can:
isolate and highlight critical information, which helps the algorithms "focus" on what's important
bring in domain expertise by creating an indicator variable, which "indicates" if an observation meets a particular condition relevant to the domain
reduce the complexity of their dataset by combining features using products, sums, or differences
Data reduction is really about dimensionality reduction. The more features you have, the more dimensions you have in your dataset. Some machine learning algorithms, like linear regression, get relatively slow when there are too many attributes.
In general, very large datasets take longer to train and test, affecting the number of hyperparameters that can be tested. An analyst can reduce the amount of time this takes by reducing the dataset, in other words, removing observations and features.
If a dataset is too big, the analyst may randomly sample the data to cut it by 50%. However, it is vital to maintain the original distribution of the dataset. Stratified sampling is a technique that guarantees that the data distribution will remain the same in the reduced datset.
Correlation analysis allows the analyst to identify pairs of variables with high correlation— meaning when one variable is high, so is the other, and vice versa. It might not be useful to have both variables in the dataset. The decision is up to the data scientist.
Forward or backward attribute selection is a process that requires the analyst to already have a machine learning model in mind. In backward attribute selection, the analyst will train the model using all the attributes, evaluate its performance, and remove an attribute to measure its effect on the performance. If there is no effect on the predictive performance of the model, it is deleted. Similarly, in forward attribute selection, the analyst begins with training the model with one attribute. The analyst adds an attribute each round, evaluating its effect on the model's performance.
If they haven't already, the analyst can now partition their data and begin testing and evaluating different machine learning models based on their performance in predicting the output variable.
If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that.