How to fix various data issues in a few simple steps? In this first part, I discuss missing, outdated and unobserved data, data that is costly to produce, as well as dirty, unbalanced and unstructured data. The second part deals with biased, inconsistent, siloed, too big or fast flowing data, as well as security/privacy and precision issues. The second part also addresses the case when features are too numerous (wide data) and issues related to high dimensional data. Data leakage will be the topic of a separate article. To not miss these future articles, sign-up to receive Data Science Central newsletter, here.
Missing Data
Missing data arises for various reasons: a survey with incomplete answers, or censored data. The latter occurs when you measure the lifetime of a component over a 3-year time period, but not all components die within 3 years. Also, it is expensive to run the experiment for longer time periods. In this case, the fix is to use appropriate survival models as actuaries do to build life expectancy tables. For truly missing data (the first case), techniques such as decision trees work well. Regression techniques don’t work well. However, one regression technique called partial least squares (PLS) handles this situation quite well.
Unobserved Data
A good example is Covid infections and recovery without test and thus without data trail. At the beginning, these cases were vastly undercounted, because unobserved. They outnumbered problematic cases by a long shot, thus significantly biasing epidemiologic models. To be aware of hidden, non-captured data, the solution is to hire a neutral advisor who is very good at imagining all potential scenarios. Epidemiologists are risk-adverse, statisticians do not always see the big picture. So you need the opinion of an educated non-expert who can think out of the box. Also, look for alternate data. In the case of Covid, sewage data may be helpful.
Expensive Data
A typical example is clinical trials. Some vendors specialize in helping companies design models for smaller, not bigger data. The solution boils down to good experimental design, and extracting the best out of small or modest data sets. Look for biostatistics models, as a starting point. Many of these models apply to various contexts, not just clinical trials.
Dirty Data
Your data set may contain duplicate records or duplicate IDs. If based on user input, it may contain erroneous fields, such as typos in zip codes. Some fields may not be properly coded. Or the data is a blend coming from multiple sources, each with a different set of features, or identical features but measured differently, and thus incompatible. Automate data capture (let the user select the zip code on your web form, or fill it automatically based on city). Create a data dictionary to detect the top values attached to each feature: for instance, an integer value might be set to 99999 or NaN, meaning it is missing. A string containing special characters (a comma) was truncated during parsing: if it represents a URL, then that URL is now wrong. Perform data reconciliation: see my patent on this topic, here. When parsing text data, use a robust parser. Your engineers need to master regular expressions! Finally, look for outliers: these observations are not necessarily wrong, but they are always insightful. A lot of this data exploration step should be automated.
Unbalanced Data
Fraudulent credit card transactions amount to about 4 out of 10,000 transactions. In medical data, some cases are very rare. One solution is to rebalance the data and to over-sample (say) your training set of fraudulent transactions. More and more, synthetic data is used to fill the gap. Augmented data, consisting of a blend of observations combined with synthetic data, usually works best. View my talk on the subject, here.
Unstructured Data
To get the best out of unstructured data (email messages, customer support conversations), become an expert in NLP techniques. There are ways to structure unstructured data, see here. Basic techniques that simply extract keyword lists and perform keyword matching, are error-prone. For instance some keywords can not be split (San Francisco is one word, not two). Text parsers that remove special characters or can’t handle foreign (accented) characters may create noise in your data, possibly resulting in misaligned columns if your data is in CSV files.
Outdated Data
The data that you collect to create an economic index or measure a recession, change over time. Lookup tables in your dataset need regular updates. Tracking positive Covid tests is useless if most people stop testing or the disease is no longer a threat. The definition and measurement of a feature can change over time. It is OK to combine old and new data, but you should include time stamps in your datasets. And keep a log of all the events seriously impacting your data. When designing a data collection procedure, you need to discuss data updates and maintenance upfront.
About the Author
Vincent Granville is a machine learning scientist, author, and publisher. He was the co-founder of Data Science Central (now part of TechTarget) and most recently, the founder of MLtechniques.com.