„Regenerating data to reflect reality not as it is but as we would like to see it“

This quote is from a response by Mostly AI, a synthetic data generation company from Austria, to an article by Karen Hao in the MIT Technology Review from 2021.

In her article, Hao discusses the then relatively recent surge in synthetic data generation by a handful of companies to address the growing data hunger by big tech companies. Coinciding with an increasing data regulation spearheaded by the European Union EU, data privacy had become a point of concern.

As a way out, the prospect of synthetic data generation was enticing. Not only would it preempt privacy concerns, it also offered a cheaper and cleaner alternative to the often arduous and messy data collection and cleaning processes.

Even more, in the face of discriminatory algorithms, synthetic data promised to tackle biases and contribute to greater diversity for training machine leaning algorithms. As result, predictive algorithms would not simply replicate biases but be trained ‚on the reality as we would like to see it‘.

The irony of such a statement is obvious. Although synthetic data indeed offers a way out of the vicious cycle in which algorithms trained on biased datasets reproduce discriminatory practices, the questions remains: Who decides which biases to mitigate and how?

Framed in this way, it becomes clear that the generation of synthetic data by an is not merely a technical exercise but a political question. Remaining a reality we wish for and generate datasets accordingly – more diverse and fairer – is certainly laudable. But it is a discussion that should involve not only business executives, software engineers and data scientists but also the public at large, end customers and, depending on the sensitivity of the specific use case, political decision-makers.