Summary: If you’re responsible for a Data Science team of more than three or four it’s time to start thinking about productivity and efficiency.
Efficiency is not something we often think about in managing our data science teams but increasingly we should. Supposing you are a data scientist who is now asked to lead your group, or even more difficult, suppose you are not a data scientist and have the DS group reporting to you. How do you ensure you’re getting the appropriate return for your investment?
Especially if you are a non-data scientist executive with overall responsibility for a DS group, even asking the right questions may seem daunting. After all it took a lot of effort to get funding, then to find those rare hires, and finally to get them up and running. And they speak that arcane DS dialect that not even the techies in IT can understand. It looks like they’re doing OK. They’re producing some useful models and bringing new business insights. But could they be doing better? Here are three tips to consider to get the most out of your group.
1. Common Platform
First, there’s no longer a place for the lone wolf data scientist in an advanced analytics shop. That almost assuredly means that you’ve needed to drive them toward a common advanced analytics platform. Could be R or Python, or it could be SAS, SPSS, or one of the other proprietaries, but you can’t have everybody doing their own thing.
When you prepare predictive models using different platforms or languages the overall intent may be the same, but unless everybody’s speaking the same language the communication will suffer, meaning fewer minds can share a problem and that there’s less supervision and collaboration. Both these are worrisome.
By the way, a common platform doesn’t mean just the one that runs the data science algorithms. Importantly, since most of the time in any new project is spent in blending and cleansing/preparing the data your common platform should be capable in this area also.
Here’s my personal bias. Having everyone write original R or Python code may look cool but it’s not reliably repeatable by the same data scientist or between data scientists. Platforms that incorporate drag-and-drop interfaces (aka visual IDEs) are typically designed with repeatability in mind and to my way of thinking that a big advantage.
2. Common Methodology
The second is there should be an agreed process or methodology. All data scientists are originally raised with certain principles and these are most commonly embodied in the CRISP-DM methodology (Cross Industry Standard Process for Data Mining). I had the pleasure of helping to develop this back in the 90s and there’s nothing magic here, just good common sense. But unless you have an agreed methodology and enforce it, you won’t know who is cutting corners and with what consequences.
There are however two other areas you need to look at that may not be as obvious as these.
3. How Much Accuracy Do You Need?
A primary reason that drives individuals to become data scientists is a love of problem solving and finding the hidden signal in the data. It’s a very creative process that starts with data exploration, data cleansing and transform, feature engineering and selection, and finally model creation and deployment.
In the 90s when the tools were less helpful it wasn’t unusual for a single model to require in the range of a week to a month to develop. That’s a 4X spread in model development time. Although today’s tools are easier and faster to use I would bet that the spread in time to develop is still easily 4X. Clearly you would like to be on the short end of that development time and get more good models in less time.
So first, start keeping some metrics and figure out how long it is taking for your data scientists to produce models. Yes some problems are tougher than others but you can figure out the categories and which are comparable.
There are a handful of industries like digital advertising where the business environment puts a natural lid on this. In bidding on page views from one of the ad exchanges for example the whole process must occur in 100 ms and the process of determining the bid/no-bid and the price to be paid is restricted to about 10 ms. That actually limits the types of algorithms that can be used and creates natural limits for good-enough accuracy. Most use cases however don’t come with these built in limits.
What’s Taking So Long? The Desire to Win.
It’s likely that the most common cause of long development times is the desire to win! Winning to a data scientist means building the most accurate model, measured by whatever cost function is appropriate, R^2, AUC, P&L, or others. That’s in our blood. Those are our bragging rights. We’ll work those steps over and over again to make it a little better and won’t be happy until we’ve eked out the last little bit of accuracy.
The question is at what point did the cost of that increase in accuracy break even with the business value? In many modeling situations especially those that involve buying behavior the breakeven is fairly straightforward to calculate.
As a manger of data scientists you’ll need to be a little careful here. A very small increase in model fitness can actually leverage to a much greater increase in campaign ROI. Here’s an example of where a .012 point increase in AUC made an 8% difference in profitability.
But this isn’t a Kaggle competition where we’re winning by a one point change at the fourth decimal point. You’ll need to dig in a little here, determine if it’s excess data prep, excess time integrating external data, or excess time in modeling that’s going on. Once you understand the problem, you’ll be able to set some guidelines and expectations.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: