We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.
1. Eliminating sample size effects. Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to compare values computed on two data sets of different sizes. Based on re-sampling techniques, use this easy trick, to compare apples with other apples, not with oranges. Read more here.
2. Sample size determination, and simple, model-free confidence intervals. We propose a generic methodology, also based on re-sampling techniques, to compute any confidence interval and for testing hypotheses, without using any statistical theory. Also, it is easy to implement, even in Excel. Read more here.
3. Determining the number of clusters in non-supervised clustering. This modern version of the elbow rule also tells you how strong the global optimum is, and can help you identify local optima too. It can also be automated. Read more here.
4. Fixing issues in regression models when the assumptions are violated. If your data has serial correlation, unequal variances and other similar problems, this simple trick will remove the issue and allows you to perform more meaningful regressions, or to detect flaws in your data set. Read more here.
5. Performing joins on poor quality data. This 40 year old trick allows you to perform a join when your data is infested with typos, multiple names representing the same entity, and other similar issues. In short, it performs a fuzzy join. Read more here.
6. Scale invariant techniques. Sometimes, transforming your data, even changing the scale of one feature, say from meters to feet, have a dramatic impact on the results. Sometimes, you want your conclusions to be scale-independent. This trick solves this problem. Read more here.
7. Blending data sets with incompatible data, adding consistency to your metrics. We are all too familiar with metrics that change over time and result in inconsistencies when comparing the past to the present, or when comparing different segments with incompatible measurements. This trick will allow you to design systems where again, apples are compared to other apples, not to oranges. Read more here.
To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.
Resources from our sponsors
- The State of Data Preparation in 2019 – June 25
- AI in Action: Real-time Anomaly Detection – June 18
- Balancing AI Endeavors with Analytic Talent – DSC Podcast