Many of the stages in a typical data science lifecycle have more to do with data than they do with science. Before the data scientist can actually engage in science, there are several steps they must first complete:
- Find where the right data is located.
- Access that data, which requires an understanding of the bureaucracy of the organization in terms of ownership, credentials, access methods, and access technologies.
- Transform the data into a format that is easy to use.
- Combine that data with other data in other sources, which may be formatted differently.
- Profile and cleanse the data to eliminate incomplete or inconsistent data points.
Complicating issues is that 87 percent of data science projects never make it into production. The reason behind this high failure rate is because the variety of data sources, diversity of data types and data volumes make it a complex task to access the right data at right time by data scientists.
Eliminating the Bottlenecks
While a number of technologies are competing to bridge the illusive gap between data and the data scientist, one modern data integration and data management technology is solving this issue using a novel approach. Rather than physically moving data so that it can be discovered, accessed, and leveraged by the data scientist, data virtualization (DV) provides data scientists with a real-time view of the data in its existing locations.
Architecturally, data virtualization occupies a layer between the different data sources and the consuming applications. The DV layer itself contains no data; it only contains the metadata necessary to access the different data sources. And while the technology does not eliminate data preparation activities, it greatly accelerates them, effectively eliminating the key bottlenecks in the data science lifecycle.
It’s important to recognize some of the ways that data virtualization can remove the log jam in a typical data science workflow and how it can be used to overcome the four challenges of the typical data science lifecycle:
- Identifying Useful Data: DV provides data scientists with a single unified SQL interface for accessing all data including physical data lakes, Spark or Presto implementations, APIs delivering Salesforce and/or social media data, or flat and/or JSON files. Some DV solutions also provide data catalog capabilities, which enable data scientists to discover data with search-engine-like features and also recommend or rate different data sets.
- Modifying Data into a Useful Format: Data virtualization facilitates the combination of data from different sources using SQL in joins, aggregations, and transformation. In some data virtualization solutions, they also provide administrative tools that offer drag-and-drop simplicity. Data scientists can leverage their own notebooks, such as Jupiter, for such operations, or use the notebooks included in some DV offerings. In either case, these notebooks provide highly flexible, visual interfaces and intuitive features like automatically generated charts.
- Analyzing Data: With data virtualization, analysis can begin almost immediately at the point of access, and when identifying useful data or modifying it into different formats, the data scientist is already executing queries.
- Preparing and Executing Data Science Algorithms: Advanced DV solutions provide query optimizers, which streamline query performance using techniques such as maximizing the push-down of processes to the sources. The optimizer might push down only a part of the operation, depending on the best expected results. DV can also accelerate model scoring and provide frameworks like Python, for example, to automatically publish models as REST APIs.
- Sharing Results with Business Users: Leveraging a data catalog as part of a data virtualization implementation, data scientists can share their queries with other team members, for a more collaborative, iterative workflow. Data scientists can execute filters or aggregations and share them with others to see if they are on the right track. At any time in the workflow, data scientists can ask for feedback regarding queries-in-process. Once the models are in place, and the results are ready, data virtualization provides different ways of sharing that information with business users. The DV solution might use its native driver to deliver the data directly to a specific application like Tableau, MicroStrategy, or Power BI. Users of those tools would connect to the data virtualization server and see the results directly in their chosen tool.
Data Virtualization and the Data Science Lifecycle
Data science can be streamlined by eliminating a number of key bottlenecks, all of which have to do with data. Fortunately, data virtualization is a technology that has proven that it can remove all of them. The technology can be strategically deployed at all of the critical phases of the data science lifecycle, accelerating data science initiatives with real-time access to disparate sources of data and enabling the business to rest easier knowing that decisions are being made using complete and proven data
About the Author: Paul Moxon is the VP Data Architectures and Chief Evangelist at Denodo, a leading provider of data virtualization software. Paul has over 30 years of experience with enterprise middleware technologies with leading software companies, such as BEA Systems and Progress Software. For more information visit https://www.denodo.com or https://twitter.com/denodo.