As stated in the last article, database obesity due to numerous intermediate tables and stored procedures is rooted in the closed computational system. If there is an independent computing engine providing computing capability independent of databases, then the latter can lose weight.
With a separate computing engine, the database-generated intermediate data doesn’t have to be stored as data tables; instead, it can be stored in the file system to be further computed by the computing engine. The read-only intermediate data, when stored in a file format, doesn’t need to be rewritten but maintaining its compactness and experiencing a higher compression ratio. And transaction consistency isn’t required for its access. Compared with the database, this simple storage and access mechanism enables much better I/O performance. A file system organizes data in a tree structure. It manages the intermediate data generated by different applications (or modules) by category. This is convenient and attaches the intermediate data to its application (or module) to prevent it from being accessed by other applications (or modules). When a module is changed or offline, the intermediate data it generates can be changed accordingly without worrying about the coupling problem caused by data sharing. Similarly, a stored procedure generating intermediate data can also be moved out of the database to become a part of the application to get rid of coupling problem.
Non-database-generated intermediate tables can also be reduced or eliminated. The extract and transform stages of an ETL operation can be handled outside of the database by a computing engine, and then the clean data will be loaded into the database. The first two stages don’t consume database computing resources and thus intermediate tables are not needed to store data. The database is just responsible for storing the final result.
A computing engine can deal with mixed computations for data presentation involving non-database data sources and database data, making it unnecessary to load external data into the database and thus reducing intermediate tables considerably. That the computing engine sends an ad hoc data retrieval request to the data source to get the most recent data for presentation enables a better real-time capability. But by loading data in and storing it as intermediate tables periodically, the most recent data could be missed. Leaving external data where it is helps exploiting strengths of the non-database data sources. NoSQL databases are good at data searching by key values and handle data of various structures well. A professional data computing engine with a good ability of handling multi-level data, like XML and JSON, beats conventional relational databases in phrasing computing logics.
Apart from the essential computing power, a computing engine intended to relieve database burden must possess good openness and be integration-friendly.
The concept of openness refers to the computing capability independent of any storage system. A system with open computing ability can compute data coming from any data sources, like the file system and enables choosing a suitable storage plan to organize and manage the intermediate data. But a computing system requiring a specific data storage mechanism (say the database) is the same old stuff with a different label. The concept of integrability means the computing procedure is embedded into the application to be a part of it, rather than being a separate process that is shared by multiple applications (modules). Thus application coupling won’t happen.
Measured by the two features, the Hadoop system (including Spark) isn’t suitable to work as an open computing engine though it has some computing power. It possesses a certain degree of openness to compute external data, but the performance is poor and seldom is the ability employed. A Hadoop system is huge and runs as an independent process. It lacks nearly any integrability and can’t be fully embedded into an application.
A true open and integration-friendly computing engine enables the separation of computing ability from storage strategy, making it convenient and flexible to design an application’s structure. With such a computing engine, there’s no need to deploy an additional database or scale-out the database in order to access computing power. It lets the database do the job it is best at, making the most use of the resources.