Data analytics is a mature discipline at this point, and even those outside the data science world generally understand what it’s all about. Modern data science, however, is still new enough to spur questions. Vincent Glanville, Executive Data Scientist at Data Science Central, spoke with Roy Wilds, Chief Data Scientist from PHEMI, a Vancouver-based big data startup, about the best way to educate people on the transformative power of big data paired with data science.
VG – When you’re working with customers who aren’t necessarily data experts, how do you explain the difference between data analytics and the kind of work they can do with big data and data science?
RW – As you can imagine, this is a conversation that we have a lot. We work with many organizations that are already using data analytics to report on trends, identify outliers, track key performance indicators, power executive dashboards, that sort of thing. So how do you explain why the combination of big data plus data science is something else altogether? We focus on three big differences.
First, it’s about the kinds of questions you want to ask. If you want to understand certain statistics of your data then you look to traditional SQL based data analytics. But if you want to perform other kinds of analytics—natural language processing, predictive modeling, dimensionality reduction—basic data analytics can’t accommodate that. Data science allows you to ask different kinds of questions and look at your data in very different ways.
Now, when you combine that with big data, you can also begin analyzing different kinds of data than you could before. Can your data be easily represented in relational data models—linking columns and tables in a traditional SQL database? Or are you trying to identify patterns in a big, messy collection of all sorts of data types? The more diverse data sources and formats you’re using, the more unstructured data you want to analyze, the harder it’s going to be to fit into a traditional tabular format.
Finally, it’s about scale. Are you talking about a month’s worth of data? Or are you going back through years of longitudinal public health studies, public opinion surveys, medical records data? Are you trying to sift through millions of EKG results to find patients with a certain condition? For the former, conventional SQL-based data storage and analysis tools are fine. For the latter cases, you’re going to want to harness the power of big data and a distributed programming framework like Apache Spark.
Ultimately, I’d put it this way: traditional analytics can be very effective when you have well defined, specific questions, which tend to produce relatively simple answers. If you want to ask open-ended questions—if you’re not even sure yet what you’re going to find in your data—then you want the flexibility of a data science approach, using something like Spark in a big data setting, to continually ask new questions and examine vast amounts of data in new ways.
VG – Is there a common pain point you find where organizations really open up to the possibilities of more advanced data science?
RW – Yes, when organizations are ready to move from basic descriptive analytics to more predictive and prescriptive data models. We see this especially in more forward-looking organizations, where people have gotten a taste of what they can do with their data, and now they want to really capitalize on it to gain a competitive edge.
These organizations are already nearing the limit of what they can learn from more basic data representations. Now, they want to find out what they should be doing differently, or identify some new need or opportunity before anyone else does. These are the cases where you can start to use machine learning—using Spark with MLlib in a big data setting for example—to move into more open discovery and predictive analysis.
VG – Do you have specific examples you use when discussing this with customers to illustrate these differences in concrete terms?
RW – Of course. I point to work the City of Houston has done, where they had years of data on hurricanes and storm surges in the region, as well as long-term economic and population studies. They were able to create a model linking storm damage to socioeconomic data. They’re predicting the populations that are likely to be hardest hit by flooding from tropical storms, and using that as a basis for policy.
The City of Chicago is also doing very interesting work to give first responders a smarter, more comprehensive operational view of the city. They’re pulling together practically every data source you can imagine—911 calls, video surveillance feeds, city bus location data, social media from citizens, geospatial tags from 311 reports. They’re running analysis on this data and building these incredibly in-depth maps of what’s happening in the city at any given moment.
I think both of these cases illustrate the power of using big data and modern data science to aggregate massive amounts of information and ask entirely new kinds of questions. When you make the move to big data and distributed computing, you have the scale and flexibility to open up whole new worlds of possibilities.