Introduction to Data Cleaning

There is a considerable amount of flexibility in the analyst's approach in shaping the data into its smallest, most useful form. Relative to the other steps in this process, the analyst will spend the most time cleaning and preparing the data.

We use machine learning to tease out knowledge from a dataset or to make predictions. Data doesn't necessarily mean anything until it's analyzed — there is a distinction between having data and having knowledge.

In other words, you can collect as much data as you'd like and create tables with them. However, you won't have any insight unless you have processes and tools to understand your information and what it means.

Most datasets are messy and noisy. Data cleaning is required before the analyst can apply machine learning techniques to the data. This process involves the following:

  • using statistical measures to figure out what's going on with the data

  • charting and plotting data to identify trends and how variables relate to each other

  • standardizing or normalizing the data

  • reducing the dataset to the most relevant, useful observations

The fundamental goal of data cleaning is to ensure the data represents the real world and is in a format that can be fed into a machine learning model.

What is data?

Data has to be measurable in a standard way. Suppose you're trying to survey people. In that case, your questions and answers need to be consistent with some standard to analyze them. You can't have some people giving a rating from 1 to 5 and others saying "good" or "bad" for the same question.

Data in machine learning contexts can be categorized by whether there is distance between the values, ordering for the values, and whether there is an absolute zero for the scale of values. These are used as indicators for what statistical measures can be used to explore the data.

Distance refers to a quantitative value that demonstrates how far apart two values from the same feature are.

Ordering refers to an inherent hierarchy in the values of a feature.

This example can best explain an absolute zero: you can say you have zero children, but you cannot say there is no temperature outside because it is zero degrees on the thermometer.

The types of data

Nominal - has no distance and no ordering

Ordinal - has ordering, but no distance

Interval - has ordering and distance, but no absolute zero

Ratio - has ordering, distance and an absolute zero on the scaling

Features can be categorized based on whether its values are numerical or categorical. Categorical or class variables are nominal values. Numerical values can be ordinal, interval, or ratio values.

What is data cleaning?

Data cleaning is an iterative process done until the dataset is in its smallest, most useful form. It is composed of two stages — exploration and pre-processing.

Exploration mainly concerns itself with statistical analysis and visualization. Pre-processing or "data mining" refers to the process of transforming and reducing the dataset into its smallest, most useful form.

What characteristics of the dataset need to be "cleaned"?

  • Outliers: values representing a small percentage of the actual data in the real world, which skew the dataset away from reality

  • Missing data: observations that are missing some features

  • Malicious data: someone trying to intentionally skew the results of the algorithm by creating data that is not representative of reality, but representative of their interests

  • Erroneous data: the way that the data was collected results in errors in the data

  • Irrelevant data: observations that are not necessary for the algorithm's goal

  • Inconsistent data and formatting: observations that refer to the same meaning but have different representations of the data

How often is there missing data? Is this common?

Extremely common. For many data scientists, analysts, or engineers, dealing with missing or incorrect data is an everyday experience. There is a multitude of reasons why:

  • Combining data from multiple sources who didn't use the same measurements

  • Measurement changed during the process of data collection

  • Measurement paused for a period of time, leaving a gap in the data

  • Human errors during data entry

  • Incorrect sensor readings

  • Software bugs in the data processing pipeline

What can be done about missing data?

It depends on how much of the data is missing and what other data exists in the dataset. The process of how to fill in the missing data is entirely up to the data scientist. The data analysis and visualization stages aim to inform these decisions about transforming the data into a form that a machine learning model is trained on.

If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that.

Want to submit a case study or have a question?

Thank you to our sponsors: 


© 2020 Atlas Lab      Privacy Policy