Not a few big organizations find their databases (or data warehouses) crammed with a huge number of old data tables, sometimes tens of thousands of them, after many years of operation. People have already forgotten why they are created; these tables even have long been useless. But all are kept for fear of mistaken deletion, causing heavy operation and maintenance workload. Moreover, a large number of stored procedures feed data continuously to these tables, seriously consuming the computational resources and making database scale-out an endless process.
Does the real-world business analytics need all these tables? Or is there a complicated computation that involves tens of thousands of tables? Unlikely will be an experienced developer’s answer. Hundreds of tables are enough to implement extremely complex logics. Most of those old tables are, in fact, intermediate tables storing no raw data. But where these intermediate tables come from? What are they used for? Generally there are two sources of intermediate tables: the internal and the external.
Intermediate tables generated within the database are for data presentation (during reporting or query). The performance is poor if aggregation is performed based directly on an extremely large amount of raw data, and user experience will be bad. To improve user experience, we’ll pre-aggregate data to generate intermediate results based on which a report will be produced. These intermediate results are stored as intermediate tables. Other times the computing logic is complex; and embedding the computation into the report development will make the whole process too complicated. In those cases there will be pre-aggregation, whose result is stored as intermediate tables. These intermediate tables will be updated by each use of the stored procedure, take up more and more storage space as well as consume computational resources. Besides, reports usually undergo many changes and additions, making intermediate tables accumulate.
Why do we store the intermediate values as database tables? That’s because we need the database computing ability to further process them. The intermediate data can’t be presented for reporting without a second computation, often a simple one like parameter-based filtering or a further aggregation. And computations are what the database is good at. Regular files, however, don’t have the computing ability. That explains the practice of using intermediate tables among programmers.
Two causes are responsible for the existence of intermediate tables that come from the external. One is the ETL operation. The operation involves handling database data. Normally it is performed in the order of extract, transform and load, represented by E, T and L respectively. That is to say, raw data will first be extracted and transformed, and the data finally loaded into the database is the desired result. Both extract and transform stages are computation-intensive, but it is inconvenient to perform the computations outside of the database. In actual practice, all the raw data is loaded into the database to be extracted and transformed, turning ETL into ELT or even LET. The loaded raw data is stored in the form of tables in a relational database, creating many extra intermediate tables.
Another is the issue of heterogeneous data sources in data representation (reporting or query). Often the data modern applications used to produce reports is non-database data. Generally the external data is loaded into the database on a regular basis to be computed along with the existing database data to generate reports, because it is nearly impossible to process a mix of database data and non-database data. This of course will generate more intermediate tables in the database. Sometimes data downloaded from the web is in the JSON or XML format, and multiple associated tables are needed to store it in a relational database. It’s no wonder there are more and more intermediate tables in the database.
The purpose of creating intermediate tables in all above situations is to employ the database’s computing capability. In a non-database environment, powerful computing capability seldom exists. A database, however, is a closed computational system (which can’t compute data outside it). In order to access the database’s computing capability, data has to be first loaded into it, resulting in intermediate tables.
The database’s closed storage system ensures the integrity of data by meeting a one particular data restriction. But a closed computational system isn’t necessary because computations don’t differentiate between database data and non-database data. Since the database computing model is based on its storage model, the former becomes as closed as the latter and the database becomes bloated thanks to the effort of obtaining its computing capability. Not only a bloated database is a problem to management, but, in view of the expensive database storage and computational resources, it isn’t economical if we scale-out a database or deploy a new one just to obtain the computing power.
A bloated database is the result of its closed computational system. There are two other technical mechanisms that contribute the difficulty of database operation and maintenance. A database is an independent process. Its computing power can be shared by multiple applications but never belongs to a specific one. Each application can access the database resources, and the intermediate tables and stored procedures generated by an application (or a module) can be called by another application (or module), creating high coupling between applications (or modules). Even if the source of an intermediate table is offline, the table can’t be deleted because it is probably used by another application.
Database tables are organized in a linear fashion. They are easier to understand when the items are not too many, but become difficult to understand when there are too many, like thousands of or ten thousands of, items. Generally the multi-level tree structure is used to manage such a huge number of items, but relational databases don’t support the structure (though the schema concept enables the understanding that data can be arranged in two levels). Long names thus become the marks of different items. Not only the strategy is inconvenient, but it requires a rather high development and management skill. Unfortunately the naming conventions are often hard to be strictly observed, particularly with deadline ahead. Over time the intermediate tables pile up messily and become difficult to identify.
Anyway, the closed computational system is the root of bloated databases.
Originally posted here.