Since starting my own AI / machine learning research lab over a year ago, I published 24 technical papers and 4 books, in addition to my articles on Data Science Central. Here I list the most popular ones in random order, with a short summary. The number attached to each paper corresponds to its entry index on my blog (in reverse chronological order), for easy retrieval. You can access them on my blog, here.
Feature Clustering
I discuss how to perform clustering (also called unsupervised classification) on the feature set, as opposed to the traditional clustering of observations. I use cross-correlations between features as a similarity metric, to retain one representative feature in each feature group. The goal is to reduce the number of features with minimum loss in terms of predictability. It is an alternative to PCA, but without combining the features into meaningless predictors. Finally, I apply the method to synthetic data generation. See article #21, and the related article #23 about randomly deleting up to 80% of the observations without loss in predictive power.
Data Synthetization: GANs vs Copulas
Synthetic data has many applications. Companies use it for data augmentation to enhance existing training sets, to rebalance datasets (fraud detection, clinical trials), or reduce algorithmic bias. In my case, I use it to benchmark various algorithms. Case studies in my article include insurance and healthcare datasets. The focus is on tabular data, replicability, and fine-tuning the parameters for optimum performance. I discuss two methods in details: generative adversarial networks (GAN) and copulas, showing when GAN leads to superior results. I also feature this material in my classes and book on the topic. See article #20.
Math-free Gradient Descent in Python
Almost all machine learning algorithms require some optimization technique to find a good solution. This is what all neural networks do in the end. The generic name for these techniques is gradient descent. But it comes with all sort of flavors, such as stochastic gradient descent or swarm optimization. In my article, I discuss a very intuitive approach, replicating the path that a raindrop follows from hitting the ground uphill to reaching the valley floor. There is no learning rate, no parameter, and the technique is math-free. Thus it applies to pure data in the absence of mathematical function to minimize. In addition to contour levels, I show how to compute orthogonal trajectories and process hundreds of starting points at once, leading to cool videos. See article #17, and related article #23 where I discuss a smart grid search algorithm for hyperparameter optimization.
Cloud Regression: The Swiss Army Knife of Optimization
In this article, I describe a new method that blends all regression-related techniques together under a same umbrella: linear, logistic, Lasso, ridge and more. For logistic regression, I show how to replace the logit function by a parameter-free, data driven version. The method is first defined for datasets with no response, where all features play the same role. In this case, the dataset is just a cloud of points, thus the name of the technique. It leads to unsupervised regression. But it can also solve classification problems, in some cases with a closed form formula without assigning points to clusters. The solution involves Lagrange multipliers and a gradient descent approach. Free of statistical jargon, yet allowing you to compute confidence intervals for predicted values: I use a concept of confidence level more intuitive to non-statisticians. All standard regression techniques and curve fitting are just particular cases. See article #10, and related article #16 on multivariate interpolation. The latter describes an hybrid additive-multiplicative algorithm to get the best of both.
Gentle Introduction to Linear Algebra
This was my first article on the topic, featuring a new, simple approach to solving linear and matrix algebra problems relevant to AI and machine learning. Focus is on simplicity, offering beginners overwhelmed by mathematics, a light presentation without watering down the content. Quite the contrary, I go as far as discussing doubly-integrated Brownian motions and autoregressive processes, but without eigenvalues or jargon. You will also discover unusual continuous time series filling entire domains. The goal is to share new models with potential applications, for instance in Fintech. See article #5. Once you have mastered this content, you can move to the more advanced articles #18 and #19, dealing with chaotic dynamical systems.
Simple Alternative to XGBoost
When working for Visa in credit card fraud detection, I designed binning techniques to process combined features using a large number of small, overlapping decision trees. The methodology was originally developed around 2002 and later extended to deal with natural language processing problems. It has been constantly improved ever since. This ensemble method was independently invented from the team that created XGBoost. Thus you are likely to discover some original tricks, best practices, and rules of thumb in my article, especially in terms of algorithm simplification. In particular, it blends a simplified logistic regression with many decision trees. As in my other articles, it comes with a Python implementation, also available on GitHub. I discuss a use case: selecting and optimizing article titles for maximum impact. It has many of the ingredients that you will find in applications such as ChatGPT. See article #11.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com and co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.