Problem Definition and Data Collection

Machine-learning algorithms predict or estimate something. The first step of any analysis is defining what it should be and how to measure it.

First, the abstract goal has to be translated into a decision about what, conceptually, to predict.

A court may, for example, want to determine the bail amount for individuals being processed. Accordingly, they may decide that the prediction that best suits this goal is to predict who will commit another crime. Suppose a person is predicted to be likely to do so. In that case, they will receive a higher bail amount or not receive bail at all — in theory, reducing the amount of recidivist crime.

Likewise, someone developing an autonomous vehicle may intend to create a car that minimizes human casualties. Part of reaching this goal may be developing an algorithm within the vehicle that enables it to predict whether objects in its surroundings are pedestrians, animals, or trees.

In both examples, a decision-maker has gone from an abstract goal to a predictive goal. But at this point, we still don't have a measurable value to attain these predictive goals.

In the next step, the decision-maker must translate the predictive goal to a specified outcome variable.

The ease with which the developer can accomplish this depends on how distant the predictive goal is from a fully specified outcome variable specification.

Take the autonomous vehicle example. There, the predictive goal — determining whether an object is a pedestrian, animal, or tree — entirely dictates the form of the output variable; it will be a categorical variable with three classes. Each of the classes will be readily codable by a human to create training data.

In the court example, it is less obvious what output variable to use. It depends entirely on the people building the system to identify a proxy for predicting who will commit another crime. They may use previous offenses on a person's record, which might have resulted from racist and discriminatory policing practices.

Is this the problem definition process for all kinds of machine learning?

No. This specifically applies to supervised learning, where problem definition is defining the outcome variable. Unsupervised learning algorithms do not predict outcome variables labeled with ground truth. Instead, they group or cluster subjects together based, roughly speaking, on how similar their input data values are. For an unsupervised algorithm, then, problem definition entails deciding on a particular mathematical measure of similarity.

At Atlas Lab, our focus is not currently on unsupervised algorithms. Our reasoning is that supervised algorithms drive many legally consequential decisions — risk assessment, predictive policing, credit scoring, employment processes, and more. However, some of the content we discuss in our articles may be just as applicable to the spectrum between supervised and unsupervised learning techniques.

How do data scientists decide on an output variable?

Subject-matter knowledge

Those steeped in a particular context may have an institutional understanding that would give them a basis to believe that one output variable is more useful than others.

Technical considerations

Different types of machine learning algorithms can work with different outcome variable forms, producing different kinds of supplementary output.

Other concerns can cause analysts to choose a particular machine-learning model or a specific algorithmic output, which will result in them choosing an appropriate outcome variable to fit in with those other restraints.

It may just be the case that particular outcome variable specifications are easier or less expensive to measure. Of course, pursuing a specific outcome variable for the sake of convenience carries with it a greater risk of mismatch between the predictive goal and the variable's specification.

All algorithms are trained by optimizing an objective function or a loss function. An objective function is a mathematical expression of the algorithm's goal. Often, this is some measurement of correctness and incorrectness. A loss function is a calculated difference between the actual output and the predicted output from the model.

Data Collection

Since a vast amount of legal scholarship exists on data collection, we won't cover this topic in the machine learning process.* Our goal is to fill the knowledge gap between tech and law, particularly in areas where technical information is difficult to come across from a legal professional's perspective. For this reason, we'll focus on the steps that come after the data is collected, diving into the topic of data cleaning.

* Jonathan has written an article on legal considerations of data collection to look at for more information on this topic.

If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that.

Want to submit a case study or have a question?

Thank you to our sponsors: 


© 2020 Atlas Lab      Privacy Policy