For years, organizations have been working steadily to consolidate their data into a single place via a data warehouse, or more recently, a data lake. Data lakes offer key advantages over data warehouses, data marts, and traditional databases that require data to be structured and organized in particular ways. However, businesses have found that they are spending more to build and manage their data lakes while receiving less value from them. As a result, organizations are finding that centralized data infrastructures and approaches are presenting unintended consequences including:
- Knowledge Discrepancy: Because centralized data teams are ill-equipped to understand the data as well as the individual business teams that own specialized parts of the total dataset.
- Rigid Infrastructure: Because centralized data architectures will never be flexible enough to accommodate the needs of the different departments within an organization.
- Delayed Time-to-Value: Because centralizing data from multiple sources takes a significant amount to time, which prevents data consumers from accessing data on-demand.
To surmount these problems, organizations are looking closely at a new decentralized architectural approach to data infrastructure called a “data mesh.” According to Deloitte, “The data mesh concept is a democratized approach of managing data where different business domains operationalize their own data, backed by a central and self-service data infrastructure. The infrastructure comprises of data pipeline engines, storage, and computing capabilities that are bundled. Rather than looking at enterprise data as one huge data repository, data mesh considers it as a set of repositories of data products. Hence, a business domain (e.g. “Finance”) provides data as a product; ready to use for analysis purposes, discoverable, and reliable. This way, the data product owner is the actual business domain representative that has the deep domain knowledge.”
In a data mesh configuration, different departments or groups within an organization would own individual data domains enabled by a central self-serve data platform and governed by a set of over-arching standards to ensure interoperability. Each data domain would deliver its data products in ways that are purposefully designed for ease of consumption by their intended audiences and that conform to the global standards of the organization. Ownership is decentralized, while provisioning and governance remain partly centralized. The data mesh architecture promises to overcome the limitations of fully centralized infrastructures. However, many organizations still question how to implement this delicate balance between independent domains that are nonetheless supported by a central data platform and are looking to existing technologies such as data virtualization (DV) to address this issue.
Enabling Replication-Free Data Access
While there is a wide range of solutions becoming available to help, data virtualization is emerging as one data integration technology that is a key component to implementing a data mesh. Unlike extract, transform, and load (ETL) processes and other batch-oriented data integration approaches, data virtualization enables access to data without having to first replicate the data to a centralized repository. In this way, data virtualization can be thought of as an inherently “decentralized” data integration strategy as it establishes an enterprise-wide layer above an organization’s diverse data sources. To query across the sources, data consumers simply query the data virtualization layer which, in turn, retrieves the necessary data, abstracting consumers from the complexities of access. The DV layer contains no actual data; however, it stores all of the necessary metadata for accessing the various sources.
By providing a single place to store metadata, data virtualization enables organizations to implement automatic role-based security and data governance protocols across the organization from a single point of control. For example, organizations can automatically mask salary data unless the user has the requisite credentials to view this information. The data virtualization layer provides most of the necessary self-serve data platform functionality that is required in a data mesh architecture.
Above the DV layer, organizations can implement myriad semantic layers, structured by different departments and functioning as semi-autonomous data domains. Each of these can be flexibly adjusted, or removed without changing or affecting the underlying data. As a result, organizations can easily establish standard data definitions that can be reused across different domains and ensure semantic interoperability among the different data products, thus facilitating federated governance.
Creating Data Products
As organizations look to data mesh to develop data products, they are leveraging the DV layer to create virtual models without the need for stakeholders to understand the complexities of the sources that feed it. In this way, they can make these virtual models accessible as data products, via a flexible array of methods such as SQL, REST, OData, GraphQL, or MDX, without the need to write code.
Additionally, data products are endowed with support for features such as data lineage tracking, self-documentation, change impact analysis, identity management, and single sign-on (SS0). By centrally storing metadata, the data virtualization layer provides all the necessary ingredients for full-featured, comprehensive data product catalogs that clearly articulate an organization’s data assets, organized by domain.
Establishing Data Domain Autonomy
Because data virtualization enables organizations to build views and semantic models above the source data without affecting the underlying data, it provides a ready foundation for the autonomy of data domains. The architecture enables data domain stakeholders to select the data sources that feed their products, and change the mix as needed to suit their requirements. Business units operating their own data marts and favored SaaS applications can repurpose the information in a data mesh configuration with very little effort as data domains can be scaled independently.
It is important to note that data virtualization does not replace monolithic repositories like data warehouses and data lakes; instead, data virtualization treats such repositories just like any other source. In the case of a data mesh configuration, they become nodes in the mesh. This means that data domains with strong ties to existing data warehouses or data lakes can continue to go that route for certain data products, such as those that require machine learning. In this scenario, data products would continue to be accessed through the virtual layer and be governed by the same protocols that oversee the rest of the data mesh.
Data mesh is a promising new architecture for avoiding many of the pitfalls of highly centralized data infrastructures. But organizations need the right technology underpinning in order to leverage data mesh effectively and in in a straightforward manner, without necessitating legacy hardware replacement.
About the Author: Alberto Pan is Chief Technical Officer at Denodo, a leading provider of data virtualization software. He is also an Associate Professor at the University of A Coruña. Alberto has authored more than 25 scientific papers in areas such as data virtualization, data integration, and web automation. For more information, visit www.denodo.com or follow them @denodo