After playing with GPT for some time, testing GenAI vendor solutions, designing my own, and reading feedback from other users, I uncovered a number of problems. Here I share some of the most common issues, and how to address them. It impacts LLM and synthetic data generation the most, including time series generation.
1. Poor Evaluation Metrics
In computer vision, it is easy to visually assess the quality of an image. Not so much with tabular data. Poor evaluation metrics may result in poor or unrealistic synthetizations. You cannot capture the complex dependencies among features with one- or two-dimensional metrics. In addition, some features may be categorical or text, some numerical, some bin counts. Currently, quality measurements rely on pairwise feature comparisons or blending univariate statistical summaries. A full, true multivariate comparison (real versus synthetic) is difficult to implement. It is now available, with an open-source library for synthetic data: see here.
More to come soon for LLM. There is no more excuse for failure to capture complex multivariate patterns. Poor synthetic data is now easy to detect, and should be a thing of the past.
2. Inability to Sample Outside the Training Set
In one of my recent datasets (health insurance), annual charges per policyholder ranged from $1,000 to $65,000. Of all the vendors that I tested including open-source, none were able to generate synthetic values outside that range. What’s more, the same was true for all the other features, no matter how many observations you synthesize. Worse, it was true for all the datasets. Yet, this is not an issue specific to generative adversarial networks (GANs). In fact, my NoGAN had the same problem. Then, you may ask: what should be the limits, and how can you even synthesize outside the observation range without using bigger training sets?
Now, there is an answer to these questions: see my new article, here. And it is a lot simpler than diffusion models used in computer vision. Using cross-validation, you can even test if the maximum should be $70,000 or $250,000 depending on the number of generated observations. Or incorporate business rules that cap the minimum and maximum, for specific features. All this fine-tuned with a simple hyperparameter. Now, you can generate much richer and realistic data, even naturally occurring outliers! Also great with small datasets.
3. New Datasets Require New Hyperparameters
And by the same token, a lot of retraining and preprocessing, involving both human and computing time. In short, it is expensive. What if there was an algorithm that works a lot faster, with auto-tuning and explainable AI? That is, a robust algorithm that you can rely on without significant (if any) onboarding for each new dataset? In the context of synthetic data, there is one: NoGAN. Actually more than one, but NoGAN now has its own Python library. Since the ideas behind NoGAN originate from NLP, you can expect a version for LLM in the near future. If you want to start light with synthetic data, with a free implementation that outperforms everything else and runs 1000x faster than deep neural networks, and comes with the best evaluation metric, see my recent presentation, here.
I am currently developing a Web API where you can upload your dataset and have it synthesized. Since it’s free, my incentives are aligned with those of the user (cost optimization), not with those of cloud companies charging by the bandwidth. Yet, to facilitate adoption, I also focused on high quality and ease of use. It is under construction, on GenAItechLab.
4. Poor but Expensive Training
These days, the tendency is towards bigger and bigger training sets. For AI engineers, it is the easiest solution to overcome many problems. It is also the most expensive. Yet, I showed instances where randomly erasing 50% of your training data had no impact on performance. For simple algorithms such as linear regression, a 90% reduction resulted in improved predictions, thanks to overfitting reduction: see here. Then, as discussed in section 2, it is possible to sample outside the observation range even with small training sets. And for LLMs, customized solutions usually outperform generic versions. The idea is to only use carefully selected inputs most relevant to the output. Content taxonomy should be part of the equation. Automatically choosing what to crawl versus what not, in specific repositories (Stack Exchange, GitHub) may reduce your computation time by a factor 10.
One of the reasons behind inflated training sets is the instability of deep neural networks such as GAN. Using a fixed number of epochs, however large, does not fix the problem. Using different seeds can lead to very different results. The quality of the output depends on how close your initial configuration is, to a decent local minimum of the loss function. This is determined by the seed. However, alternatives such as NoGAN and especially NoGAN2, start with a very good approximation, a better loss function, and much less computing time. In short, reducing convergence issues, and cost.
5. Lack of Replicability
In the previous section, I mentioned the concept of seed. In short, a seed is an integer that initializes all the random number generators used in your algorithm, whether a deep neural network or anything else relying on random numbers. Because most implementations are created by engineers rather than scientists, seeds are not part of the hyperparameters. Thus, running the same algorithm twice leads to different results, making it impossible to replicate a great synthetization. It is a fact a life with current GenAI techniques, not even discussed anywhere. By contrast, all my GenAI algorithms are replicable. Yet, it would be rather easy to make GAN models replicable: I did it (just use the same seed, assuming you have one to begin with). It is something that was overlooked by developers and vendors alike, in the design stage. Note that it is a lot more difficult to achieve replicability in GPU implementations.
Author
Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.
Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.