Summary: It’s become almost part of our culture to believe that more data, particularly Big Data quantities of data will result in better models and therefore better business value. The problem is it’s just not always true. Here are 7 cases that make the point.
Following the literature and the technology you would think there is universal agreement that more data means better models. With the explosion of in-memory analytics, Big Data quantities of data can now realistically be processed to produce a variety of different predictive models and a lot of folks are shouting BIGGER IS BETTER.
However, every time a meme like this starts to pick up steam it’s always a good idea to step back and examine the premise. Is it universally true that our models will be more accurate if we use more data? As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.
Let’s start the conversation this way. In general I would always prefer to have more data than less but there are some legitimate considerations and limitations to this maxim. I’d like to lift the kimono on several of them. Spoiler alert – there’s not one answer here. You have to use your best professional judgement and really think about it.
1. How Much Accuracy Does More Data Add
Time and compute speeds are legitimate constraints on how much data we use to model. If you are fortunate enough to have both tons of data and one of the new in-memory analytic platforms then you don’t suffer this constraint. However, the great majority of us don’t yet have access to these platforms. Does this mean that our traditional methods of sampling data are doomed to create suboptimal models no matter how sophisticated our techniques?
The answer is clearly no, with some important qualifiers that we’ll discuss further on. If we go back into the 90’s when computational speed was much more severely restricted, you could find academic studies that supported between 15 and 30 observations for each feature (e.g., Pedhazur, 1997, p. 207). Today we might regard this as laughably small but these academic studies supported that models built with this criteria could generalize well.
I’ve always been a supporter of the fact that better models mean better business value. Quite small increases in fitness scores can leverage up into much larger percentage increases in campaign ROI. But there are studies that suggest that in (relatively normal) consumer behavior scoring models that anything over 15,000 rows doesn’t return much in additional accuracy.
If 15,000 observations seems trivial to you, then make it 50,000 and do your own experiments. The reductio-ad-absurdum version of this would be to choose 10 million observations if you already had 1 million to work with. At some point any additional accuracy disappears into distant decimal points. Sampling large data sets, in many if not most cases, still works fine.
2. There’s Lots of Missing Data
If you’ve got a dataset with lots of missing data should we favor much larger datasets? Not necessarily. The first question to be answered is whether the features with the high missing-data rates are predictive. If they are there are still many techniques for estimating the missing data that are perfectly effective and don’t warrant using multi-gigabyte size datasets for modeling. Once again, the question is how many decimal points out will any enhancement in result be and does that warrant the additional compute time or investment in in-memory analytic capability.
3. The Events I’m Modeling are Rare (Like Fraud)
Some events like fraud are so costly to an organization that a very large expenditure in energy to detect and prevent it can be justified. Also, because these events are rare, sometimes fractions of 1% of your dataset that the concern is that sampling might miss them. Or that a less accurate model might result in so many false positives that it would overwhelm your ability to react.
If there is an area where an economic argument can be made for very large datasets it’s probably here in fraud. All the same, if you as a data scientist were faced with the problem of producing good models with limited data and limited time what would you do?
Probably the first thing you would do is recognize that fraud is not homogenous and consists of perhaps dozens of different types each of which has different signals in the data. This leads to the obvious: cluster, segment, and build multiple models. Yes you will over sample the rare fraud events in your analytic dataset but by segmenting the problem you can now build multiple accurate models for each fraud type. Having one extremely large dataset on an in-memory platform and trying to treat this as a single modeling problem, no matter how sophisticated the algorithm, won’t be as accurate as this smaller-data approach.
4. What Looks Like an Outlier is Actually an Important Small Market Segment
An argument made by the in-memory folks is that when sampling you may observe an outlier that might ordinarily be discarded, but when seen through the lens of much more data turns out to be a small but important market segment. A good example might be a financial product consumed mainly by middle income users but also of interest to very high income users. There are so few of the very high income users that they might be missed or treated as an unimportant outlier in sampling.
The fallacy in the more-is-better argument for important small segments in similar to fraud. The buying behavior of this important small market is unlikely to have the same drivers as the much larger middle market. In fact, using millions of rows of data on an in-memory platform would encourage the user to try for a single model encompassing all the buyers. This misses that Modeling 101 lesson about first segmenting to get unique customer groups.
Large datasets might be warranted in the segmentation process but that’s not particularly time or compute intensive. Better to segment out the different buyer types and model them separately where sampled data will be perfectly adequate.
5. Data Is Not Always Cheap
If you’re getting millions of rows of observational data from your ecommerce site or floods of data from IoT sensors you may have forgotten that in many environments the cost of the data can be prohibitive and needs to be factored into your modeling strategy. There are plenty of examples where more complex consumer behaviors require expensive or time consuming surveying or focus groups. Here’s an even better example.
In 2010 the government wanted to determine the most cost effective method of detecting and removing unexploded ordinance (UxO). Basically these are unexploded bombs or artillery shells in a war zone that have buried themselves fairly deep in the ground and need to be dug up by hand which is expensive, about $125 per hole. The problem was that there were millions of random pieces of metal embedded in the battle field that were not UxO.
Using a towed array of electromagnetic sensors you could cover a fairly large piece of ground for relatively low cost yielding decent data. If the data from the towed array wasn’t definitive you had to position the array in static position over the target to get better data which was much more expensive, but still less than digging the hole. This was very expensive data with even more costly outcomes for false positives.
Using random forests and linear genetic programs the study was able to successfully detect actual UxOs with just the least expensive towed array data, clear the dangerous objects, but do so at the minimum cost.
6. Sometimes We Get Hypnotized By the Overwhelming Volume of the Data and Forget About Data Provenance and Good Project Design
A few months back I reviewed an article by Larry Greenemeier about the failure of Google Flu Trend analysis to predict the timing and severity of flu outbreaks based on social media scraping. It was widely believed that this Big Data volume of data would accurately predict the incidence of flu but the study failed miserably missing timing and severity by a wide margin.
Says Greenemeier, “Big data hubris is the often the implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. The mistake of many big data projects, the researchers note, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments”.
7. Sometimes Small Data Beats Big Data for Accuracy.
Finally there is the somewhat surprising case of smaller datasets beating larger datasets for accuracy. I first reported on this when trying to figure out if Real Time Predictive Models Are Actually Possible. This is not about whether we can score new data with existing models in real time. This is about whether we can create de novo new models in real time, in this case from streaming data.
I was skeptical. But to my surprise found one instance where greater accuracy was achieved with very small amounts of real time streaming data compared to the same model run in batch with much larger volumes of data. Even more surprising, the most accurate of the real time models was consistently created with only 100 observations, and when more streaming data was used the results deteriorated.
The example comes from UK-based Mentat Innovations (ment.at) which in December 2015 published these results regarding their proprietary predictive modeling package called “Streaming Random Forests”. To quote directly from their findings:
“When the dataset is ordered by timestamp, the best performing window size is 100, on the lower end of the scale. This is a classic case of “more data does not equal more information”: using 100 times more data (w=10,000 vs w=100) almost doubles (175%) the error rate!!”
In general would I still prefer to have more data than less? Yes, of course. But I’m not giving in to the kitchen sink approach of using all that data just because it’s there. Unless you have one of the in-memory analytic platforms there is still computational time and human data science time to be considered, and sometimes, but not always, less is simply better.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: