Deconstructing Generative Adversarial Networks and synthetic data

The utility derived from the Generative Adversarial Network (GAN) approach to advanced machine learning is less celebrated than that gained from its language model counterparts. GANs have not consistently dominated media headlines for the last couple years. Most deployments don’t involve reading massive quantities of written information to provide synopses or detailed responses to questions about it.

Yet, for the ever expanding use cases that rely on synthetic data, GANs are just as vital, if not more so, than their language model counterparts are for semantic search, chatbots, and other common generative Artificial Intelligence manifestations.

The wealth of applications powered by GANs includes a number of data science particulars, for which they overcome the scarcity of training data for domains, mitigate model bias, test models for bias, and test models in general before putting them in production. Consequently, they’re viable resources for ensuring everything from reinforcing fair lending practices in finance to building public sector policies that achieve their desired objectives—without marginalizing sections of the population.

According to Brett Wujek, Principal Data Scientist of the Artificial Intelligence and Machine Learning Division of SAS Research and Development, “The latest sophisticated algorithms over the last five years or so for synthetic data are all around GAN technology.” Understanding how GANS operate, their applicability to transfer learning, and their implications for time-series data and other use cases is pivotal to maximizing the value for synthetic data they provide across industries.

GAN architecture

The most widespread synthetic data application may likely be to provide a substitute for data containing sensitive information, like Personally Identifiable Information, or data with strict requirements for data privacy or regulatory compliance. For these use cases, GANs supply datasets that are statistically identical to the originals—yet lack any sensitive information. As such, they’re desired for facets of data governance, data security, data cataloging, and more. The three principal aspects of GAN architecture are a pair of neural networks—a generator and a discriminator—and ‘real’ training data, which may include sensitive information.

According to Wujek, “You start with some noise feeding into [the] generator: just random data that makes no sense.” The generator produces values from that data, which are evaluated by the discriminator to see if they’re similar to those of the training dataset. The objective for the generator is to reach a state in which it can fool the discriminator into thinking the generated values are the same as those of the training data. The generator’s objective is to determine the difference between those values.

Transfer learning utility

Based in part on the discriminator’s responses, the generator learns to refine its values to make them more similar to those of the training data. “Initially, the discriminator’s going to very easily distinguish between the two,” Wujek admitted. “But, it’s an iterative process. As the generator learns how to tweak its internal settings, the different weights in the internal neural networks, if you will, it generates data that is more and more like the real data and, ultimately, the discriminator can’t tell the difference between the two.”

Once the discriminator can’t differentiate the values from the generator from that of the real data, credible synthetic data has been produced. The weights Wujek mentioned are critical for not only achieving this objective, but also applying the model to other synthetic data use cases that are more efficient and cost-effective than the inchoate one. “The initial generations of GAN training are the most expensive,” Wujek revealed. “Once you’ve settled on a set of weights that make sense, those can be used as a starting point for training other models.”

Transfer learning techniques profit from the work done by successfully generating synthetic data. These techniques can employ the parameters and dimensions from one GAN to another. For example, the weights impacting a GAN that created synthetic data for mortgages can accelerate a GAN for creating synthetic data for credit card lending. Such transfer learning applications “really expedite the training process and gives you a huge jump start,” Wujek confirmed.

Time-series analysis and more

The name for GANs is predicated on the notion that the generator and the discriminator effectively work together by trying to outsmart one another—in a manner not entirely dissimilar from that of video games. According to Wujek, this architecture is “supervised learning, but it’s somewhat semi-supervised, I would say.”

Today, there are numerous types of GANS with different forms of specialization. These models can be applied to image data, relational data, and even time-series data. “It’s such a sophisticated and effective model architecture that people are using it as a starting point to build new architectures off of,” Wujek commented. “For time-series data that needs to account for an appropriate sequence of data, data that makes sense in a sequential manner, there’s time-series GANs that account for that.”

Deconstructing Generative Adversarial Networks and synthetic data

GAN architecture

Transfer learning utility

Time-series analysis and more

Leave a Reply Cancel reply