Our goal is not to teach you how to build the technology but to impart in you the mindset of a machine learning engineer.
First, a disclaimer: all metaphors and analogies have their limitations. By its very nature, a metaphor does not convey a concept in its full form. However, metaphors are extremely useful in bridging knowledge between two disparate industries. The metaphors used in these articles will not be perfect (no metaphor is). However, they will give you a baseline understanding of the landscape and process and life cycle of developing a machine learning algorithm.
What is machine learning, and what is it not? What are you going to learn here?
Machine learning is used to predict, estimate or classify something. Machine learning itself is an automated process of discovering relationships or patterns between variables in a dataset. A machine-learning model's performance is assessed by how well it generalizes to data it has not "seen" before.
Machine learning is not synonymous with artificial intelligence. Machine learning is a subset of artificial intelligence focused on training computers to use algorithms for making predictions or classifications based on observed data.
An example of AI that isn't machine learning is a robot in a maze, where the robot is programmed to turn left or right at random. It does not learn anything and doesn't understand what the maze is. Despite this, it will eventually get to the end.
Alternatively, machine learning is the process of giving the robot examples and hoping it will learn when to turn left or right. The robot isn't given specific instructions like "if you're here, turn left. if you're here, turn right."
Supervised, Unsupervised, and Reinforcement Learning
Most algorithms that you'll likely encounter in your work are supervised machine learning algorithms. For that reason, our focus at Atlas Lab is on supervised machine learning techniques.
To provide clarity on the landscape, though, here's a quick breakdown of the three major machine learning categories:
Supervised machine learning is where an analyst provides data that includes the "answers" and asks the machine to make a prediction or classification based on similar data. In this approach, the ground truth is specified for the machine learning algorithm.
Unsupervised machine learning is where data is provided without "answers" — the machine attempts to discover them. These algorithms find hidden patterns or data groupings without the need for human intervention. This approach is often used to group data based on similarities or differences, find relationships between variables, and reduce the number of data inputs while also preserving the dataset's integrity.
Reinforcement learning is where there are no classes or characteristics; there's just an end-point – pass or fail. To understand this better, consider the example of learning to play chess. After every game, the system is informed of the win/loss status. In such a case, our system does not have every move labeled as "right" or "wrong" but only has the end-result. As our algorithm plays more games during the training, it'll keep giving bigger "weights" (importance) to the combination of those moves that resulted in a win.
Often, people talk about machine learning as having two paradigms: supervised and unsupervised learning. However, it is more accurate to describe machine learning models as falling along a spectrum of supervision between supervised and unsupervised learning.
For this platform, it is impractical to discuss, in-depth, the many ways in which every machine-learning algorithm varies from every other— and we don't claim to do so. We also don't pretend that all machine-learning projects fit precisely within our breakdown. And as we will continue to mention, most of the development of machine-learning models dances back and forth across the steps rather than progressing through them linearly.
The analyst makes critical decisions that drastically affect the way an algorithm will categorize or predict an outcome. Our goal is to help you understand what decisions are made during this process. We hope that with this overview, you'll be able to ask detailed questions, do adequate research, and strengthen your case.
Far too often, those who build these systems are all of a homogeneous background. Without proper consideration and management to mitigate the effects of racism, colonialism, and anti-Blackness, machine learning systems will effectively automate oppressive systems (with the added guise of the machine learning model's perceived "higher intelligence").
How is a machine learning model trained?
What steps lead to a machine learning model influencing decisions in the real world?
To fully understand the process of training a machine learning model, we highly recommend that you read the articles in order. We've done our best to keep them relevant, concise, and clear to introduce you to the topic and to get you comfortable using the terminology of machine learning analysts.
Simplifying this mapping, you can think of machine learning as two distinct workflows: exploratory data analysis, which comprises the first seven steps of our breakdown, and "the running model," which describes a machine-learning algorithm deployed and making decisions in the real world. Exploratory data analysis is structured more like a scientific process rather than a linear one:
the problem is defined,
the analyst must make a hypothesis,
data is used to test the hypothesis,
the results are analyzed,
a conclusion is reached,
the hypothesis is refined,
and the process repeats.
Clarification on some terminology before we get started
We refer to machine-learning "algorithms" and "models" interchangeably throughout our content. Before a machine-learning model is deployed, "algorithm" and "model" refer to a set of mathematical steps for learning based on the dataset. This is known as "training" the algorithm or model. All algorithms are trained by optimizing an objective function or a loss function (more on this later).
After training, "algorithm" and "model" have slightly different meanings. They refer to the set of useful correlations to make predictions, or "rules," discovered during training. These rules are what are used to make predictions, estimations, or decisions in the real world.
What we mean when we say "the dataset"
We'll use an example: the Titanic dataset. You can think of it like an Excel spreadsheet. There are rows, columns, and cells.
A column describes data of a single type. One column can be referred to as a feature, attribute, or variable. All data from a single column will have the same scale and have meaning relative to each other.
A row describes a single entity. A row can be referred to as an observation or an instance. The values in the columns represent properties about that observation. The more rows you have, the more examples you have. A cell is one value in a row and column. It can be a number, text, or some representation of a category.
Guide to Machine Learning
And with that, you're free to get started learning about the supervised machine learning process! Again, we highly recommend that you read these articles in order.
Problem Definition and Data Collection
Machine-learning algorithms predict or estimate something. The first step of any analysis is defining what it should be and how to measure it.
Introduction to Data Cleaning
There is a lot of flexibility in the analyst's approach in shaping the data into its smallest, most useful form. Relative to the other steps in this process, the analyst will spend the most time cleaning and preparing the data. Most datasets are messy and noisy. Data cleaning is required before the analyst can apply machine learning techniques to the data. Data cleaning is an iterative process done until the dataset is in its smallest, most useful form. It is composed of two stages — exploration and pre-processing.
Data Analysis and Visualization
Data analysis and data visualization are standard methods used to explore the data and generate hypotheses. Data visualization is the act of charting and plotting data points to identify trends and how variables relate to each other. The aim is to see the shape and range of the data and to identify outliers.
Data Transformation and Reduction
Data transformation is the process of changing the format, structure, or values of data to train a machine learning model. The analyst has to make many judgment calls based on domain knowledge during this stage.
The goal of building a machine learning algorithm is to predict accurately in the real world. Therefore, data scientists have created a process that is an approximation for how the algorithm will perform in real-world circumstances. This process is called data partitioning — allowing the model to test itself on data that it has not "seen" before.
Model Candidate Selection and Tuning
Machine learning models are chosen as candidates and tested in comparison to others. Tuning is often a dance between steps in which some hyperparameter is changed. The algorithm is rerun on the data. The analyst then evaluates the model's performance on the validation set. This process identifies which set of hyperparameters results in the most accurate model.
Model deployment is the final step in the machine learning process before a model can make predictions in the real world. These predictions will carry real consequences when forming the basis of decisions.
If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that. If you've worked on a case that would be a good fit for our platform, please submit it through our contact form and feel free to create a post in the forum.