Data analysis and data visualization are standard methods used to explore the data and generate hypotheses.
Data analysis is the process of using statistical measures to figure out what's going on with the data.
Why is this important?
Reviewing summary statistics of every variable in the dataset is one way to ensure the dataset is clean and representative of the real world. It informs how the analyst will partition the data in the next stage.
Why wouldn't a dataset be representative of the real world?
Data regularly has some degree of error or random noise within it.
Outliers skew the dataset away from reality because they are values that represent a small percentage of the data in the real world. Some outliers are invalid, like someone whose age is 400 years old, and are usually replaced or removed. Others aren't so obvious, and the analyst has to decide what to do with them. More on this later.
Cardinality refers to the number of possible values that a feature can have. For example, the feature city has high cardinality, meaning it can have many labels or categories like London, New York, Manchester, and so on. Whereas the feature gender (due to colonialism and erasure of genders that exist outside of the gender binary), typically has a cardinality of 2 in datasets - male and female.
For features with high cardinality, it is conceivable that all possible values are not represented by the dataset that the analyst is working with. Depending on how widespread this issue is, the analyst may choose to combine values through a process called binning into groups to lower the complexity of the data by lowering the cardinality.
Overfitting and lack of generalizability
Machine learning models are particularly adept at finding correlations in data. With outliers and high cardinality, a model will be prone to overfitting, which leads to substantial errors and significantly reduces its predictive power.
The goal for any supervised machine learning model is generalizability, the ability to digest new data and make accurate predictions.
Overfitting is when a model conforms too closely to the noise instead of learning predictively useful rules about the dataset. Despite making accurate predictions for the training data, it will make inaccurate predictions when given new data.
Summary Statistics Review
For numerical or quantitative values, examining summary statistics might involve identifying its minimum, median, and maximum values as well as using statistical measures to determine the range that the majority of values fall.
For categorical or class variables, it might entail determining the cardinality as well as what percentage of cases are in each class.
Data visualization is the act of charting and plotting data points to identify trends and how variables relate to each other. The aim is to see the shape and range of the data and to identify outliers. Visualization is a standard method to explore the data and to generate hypotheses.
Data scientists create visualizations to figure out what trends or patterns exist in the dataset. A lot of it is exploration. It might be useful to include in an FOI request the visualizations of data during the exploratory phase. It is a space where an analyst has to be very careful about not making assumptions about the data. Visualizations can take the form of a bar graph, box plot, line chart, histogram, scatterplot, bubble chart, gauge, map, heat map, frame diagram, and so on.
What additional value does visualization add to the statistical methods outlined above?
Data visualization activates our intuition for processing data during the machine learning process and provides a way for a machine learning model to move closer to alignment with truth.
Data visualization tools enable data scientists to identify and focus on the most useful data. It allows trends and patterns to be easily recognized, like noticing changes over time, examining a network, and determining the frequency of data points and relationships between values.
Good exploratory data analysis combined with relevant data visualization is essential for recognizing the best way forward. Equipped with the insights gathered from this exploratory analysis, the data scientist can now fully move into the next stage of the machine learning process. It is known by many names: data preprocessing, data transformation and data reduction, feature engineering, and data mining.
If you notice anything incorrect or missing from our explanations, please let us know through the contact form! We want our content to be as accurate and useful as possible — and you can help us do that.