Synthetic Data for Machine Learning

Much has been written over the last years on built-in biases in training data that translate into discriminatory machine learning algorithms: from racial biases in predictive policing software or facial recognition applications to higher insurance fees for customers from low-income areas and discriminatory application parsing software.

As a remedy to these problems, the tech industry has started using synthetic data as an alternative, or to be precise, a complementary to real datasets for training machine learning algorithms.

Unlike real data, whose biases are the result of historical trajectories and cultural and political realities, synthetic data allows for purposeful design. With strides in machine learning capacities, especially in the fields of image-generation using generative adversarial networks GAN and more recently diffusion networks, synthetic data itself is being produced by machine learning algorithms.

This technique allows the production of vast synthetic datasets whose exact parameters can be controlled whilst mimicking real data. Whilst the initial development might be costly and time consuming, once a model is established, the scaling of synthetic data is quicker and cheaper than the tedious process of real data collection, cleaning and labeling (which produced a thriving industry living off underpaid gig workers sifting through these datasets).

Unsurprisingly, proponents argue that synthetic data can counter biases and their reproduction through machine learning algorithms – If you use unbiased synthetic data, it won’t discriminate.

While this logic is impeccable from a technical point of view, it opens up a set of fundamental questions on the role of data in machine learning and its wider impact on society.

Starting with the obvious, synthetic data is not grounded in reality but engineered. Certainly with an eye on the latter but likewise with an intent to mitigate existing biases in real data for the sake of better machine learning output: this might be better cancer diagnostics, self-driving cars or facial recognition software.

While such endeavors are laudable, they also point to a discussion on epistemology in machine learning dearly needed. In the face of an increasing number of start-ups specializing in synthetic data production, it is important to have a parallel debate on its real-word effect across different applications: While using it to train self-driving systems on so-called edge cases or to improve cancer analytics seems rather unproblematic, artificially-generated applicant profiles to inject greater diversity in training sets for application parsing software might be more sensitive.

Of course, correcting past wrongs and counteracting discriminatory employment practices is good, period. But is the introduction of synthetic data to this end the right way to tackle the problem? This will certainly alter machine learning algorithms and maybe hiring practices, leading to more equitable employment.

But will it also prompt a rethinking among decision makers acting on preconceived stereotypes, consciously or subconsciously? Looking at the issue in this way might suggest that we change practices but not minds. This is not necessarily bad but also not a sustainable solution to political and societal problems as tech enthusiasts sometimes argue.

Sticking with this example, a related question concerns the way synthetic data itself is manufactured. When it comes to applicant profiles, it seems that we are not solving but merely transferring the issue from one machine learning algorithm to another: namely from HR software to machine-learning applications producing synthetic training data to feed the former.

In fact, using synthetic data for training machine learning algorithms might further complicate our relations with autonomous decision-making systems and its data basis. Already at this moment in time, we struggle to comprehend the algorithmic processes that generate certain output (I consciously do not use the term „decision“). The best example is ChatGPT, whose training data is not disclosed by the company behind it OpenAI. Given that the training data involves large troves of text from the internet, rest assured some of it is incorrect information (in regards to real events that happened).

If we apply this logic to applications running (partly) on synthetic data, possibly without disclosure, it will be difficult to asses the output vis-a-vis reality: is employment software decision-making now based on synthetic data or the result of past practices? Are self driving cars now safer because all edge cases are taken into account or does this make us overconfident with regards to traffic safety and are they ruled out prematurely?

To answer these questions and remain in control of the extent to which synthetic data not only helps optimize current machine learning algorithms but in the process of doing so also reshapes reality without reference to real data, we must know where it is used and how.

We therefore need to carefully trade off the advantages and risks of synthetic data for each application. It is true that synthetic data can help reduce costs in the data production and mining process – a sector that employs thousands of people on minimal wages, amongst them refugees –, improves data privacy and of course prevents machine learning algorithms from reproducing social and political discrimination.

But it is also a fact that meddling with datasets that originally (sought) to represent the real world in one way or another by introducing synthetic alternatives has a fundamental impact on how we think and imagine the role of technology and especially autonomous decision-making systems therein.

We need to have a more substantial discussion on synthetic data beyond mere market considerations and platitudes about remedying social wrongs through technical engineering.