There is a considerable amount of flexibility in the analyst's approach in shaping the data into its smallest, most useful form. Relative to the other steps in this process, the analyst will spend the most time cleaning and preparing the data.
We use machine learning to tease out knowledge from a dataset or to make predictions. Data doesn't necessarily mean anything until it's analyzed — there is a distinction between having data and having knowledge.
In other words, you can collect as much data as you'd like and create tables with them. However, you won't have any insight unless you have processes and tools to understand your information and what it means.
Most datasets are messy and noisy. Data cleaning is required before the analyst can apply machine learning techniques to the data. This process involves the following:
using statistical measures to figure out what's going on with the data
charting and plotting data to identify trends and how variables relate to each other
standardizing or normalizing the data
reducing the dataset to the most relevant, useful observations
The fundamental goal of data cleaning is to ensure the data represents the real world and is in a format that can be fed into a machine learning model.
What is data?
Data has to be measurable in a standard way. Suppose you're trying to survey people. In that case, your questions and answers need to be consistent with some standard to analyze them. You can't have some people giving a rating from 1 to 5 and others saying "good" or "bad" for the same question.
Data in machine learning contexts can be categorized by whether there is distance between the values, ordering for the values, and whether there is an absolute zero for the scale of values. These are used as indicators for what statistical measures can be used to explore the data.
Distance refers to a quantitative value that demonstrates how far apart two values from the same feature are.
Ordering refers to an inherent hierarchy in the values of a feature.
This example can best explain an absolute zero: you can say you have zero children, but you cannot say there is no temperature outside because it is zero degrees on the thermometer.
The types of data
Nominal - has no distance and no ordering
Ordinal - has ordering, but no distance
Interval - has ordering and distance, but no absolute zero
Ratio - has ordering, distance and an absolute zero on the scaling
Features can be categorized based on whether its values are numerical or categorical. Categorical or class variables are nominal values. Numerical values can be ordinal, interval, or ratio values.
What is data cleaning?
Data cleaning is an iterative process done until the dataset is in its smallest, most useful form. It is composed of two stages — exploration and pre-processing.
Exploration mainly concerns itself with statistical analysis and visualization. Pre-processing or "data mining" refers to the process of transforming and reducing the dataset into its smallest, most useful form.
What characteristics of the dataset need to be "cleaned"?
Outliers: values representing a small percentage of the actual data in the real world, which skew the dataset away from reality
Missing data: observations that are missing some features
Malicious data: someone trying to intentionally skew the results of the algorithm by creating data that is not representative of reality, but representative of their interests
Erroneous data: the way that the data was collected results in errors in the data
Irrelevant data: observations that are not necessary for the algorithm's goal
Inconsistent data and formatting: observations that refer to the same meaning but have different representations of the data
How often is there missing data? Is this common?
Extremely common. For many data scientists, analysts, or engineers, dealing with missing or incorrect data is an everyday experience. There is a multitude of reasons why:
Combining data from multiple sources who didn't use the same measurements
Measurement changed during the process of data collection
Measurement paused for a period of time, leaving a gap in the data
Human errors during data entry
Incorrect sensor readings
Software bugs in the data processing pipeline
What can be done about missing data?
It depends on how much of the data is missing and what other data exists in the dataset. The process of how to fill in the missing data is entirely up to the data scientist. The data analysis and visualization stages aim to inform these decisions about transforming the data into a form that a machine learning model is trained on.
If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that.