SQL (or the stored procedure) can handle most of the database computations. If the computations are complex or hard to deal in SQL, we use another programming language to read data out of the database to manipulate it. Such a programming language handles the data read and manipulation with a simple script. So we call the process the post-SQL scripting.
The scenarios that SQL is not good at handling include complex set-based operations, order-based operations, associative operations and multi-step computations, etc. Due to SQL’s incomplete set orientation and lack of explicit set data type, it’s almost impossible to reuse the intermediate sets generated during the computation. The forced aggregate after each grouping operation makes it impossible to use the post-grouping subsets. The unordered-set-based SQL adopts a very roundabout way to handle order-based computations, like inter-row (group) computations and ranking operations. The language generates temporary sequence numbers using JOIN(s) or subqueries, making the program hard to write and slow to compute. Record reference is another SQL’s incapability. The language uses a subquery or a JOIN statement to express the relationship. Code becomes ridiculously complicated when there are multiple data levels or when self-joins are needed. SQL doesn’t foster multi-step coding. Programmers have to write very long query containing layers of subqueries. Though stored procedures can alleviate the problem, they are not always available. DBA has strict rules about the privileges of using stored procedures, and old and small databases don’t support stored procedures. Besides, it’s inconvenient to debug a stored procedure. This makes it unsuitable to do a procedural computation.
There are other scenarios that require post-SQL scripting. To migrate the algorithm between different database products or between database and non-relational database, data source or output target isn’t the database but a file, and mixed computation performed between multiple databases, for example. All these external database computations need post-SQL scripting.
The most important role of a post-SQL script is to achieve the complex computations that SQL is not good at. It would be better if they had some advanced features, such as the ability of processing data of various sources (files and non-relational databases, for example) and handling a relatively large amount of data, and satisfactory performance. But, the basic thing is that the scripting tool should be able to perform database read/write conveniently.
The popular post-SQL scripting tools are Java, Python pandas and esProc. Now let’s look at and examine their scripting abilities.
JAVA
High-level languages, such C++ and Java, are theoretically almighty and thus are able to manage computations SQL is hard to deal with. Java supports generic type and has relatively comprehensive set orientation to handle complex set-based computations. A Java array has intrinsic sequence numbers to implement order-based computations conveniently. Java can express a relationship using object reference and handle join operations well. It also supports procedural syntax, including branch and loop, to achieve complex multi-step computations.
Unfortunately, Java lacks class libraries for structured computations. It hardcodes even the simplest structured computations and creates the most basic data type manually. That makes Java code lengthy and complicated.
Here’s an example of order-based computation: get the number of longest consecutively rising trading days for a stock. Database AAPL stores a stock’s price information in fields including transaction dates and closing price.
In an intuitive way, we loop through the records ordered by dates, add 1 to the number of consecutive rising days (the initial value is 0) if the closing price of the current record is greater than that of the previous one; and compare the number of consecutively rising days with the current largest number of consecutively rising days (the initial value is 0) to get the new largest number and reset the number as 0. Repeat the process until the loop is over and the current largest number is the final desired one.
SQL can’t implement the intuitive algorithm because it doesn’t support order-based computations. But it has its own tricks. It groups stock records ordered by dates. The way is like this: put records whose closing prices rise consecutively in one group, that is, group the current record with the previous one if its closing price rises and create a new group if its price lowers; then count the members in every group and find the largest number, which is the largest number of consecutively rising days.
Examples of such SQL code can be found here.