To SQL or not To SQL: that’s the question!
Lemahieu W., vanden Broucke S., Baesens B.
This article is based upon our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com See also our corresponding YouTube channel with free video lectures : https://www.youtube.com/watch?v=o36Z_OqC2ac&list=PLdQddgMBv5zHc…
Relational database systems (RDBMSs) pay a lot of attention to data consistency and compliance with a formal database schema. New data or modifications to existing data are not accepted unless they satisfy constraints represented in this schema in terms of data types, referential integrity etc. The way in which RDBMSs coordinate their transactions guarantees that the entire database is consistent at all times, the well-known ACID properties: atomicity, consistency, isolation and durability. Consistency is usually a desirable property; one normally wouldn’t want for erroneous data to enter the system, nor for e.g., a money transfer to be aborted halfway, with only one of the two accounts updated.
Yet, sometimes this focus on consistency may become a burden, because it induces (sometimes unnecessary) overhead and hampers scalability and flexibility. RDBMSs are at their best when performing intensive read/write operations on small or medium sized data sets or when executing larger batch processes, but with only a limited number of simultaneous transactions. As the data volumes or the number of parallel transactions increase, capacity can be increased by vertical scaling (also called scaling up), i.e. by extending storage capacity and/or CPU power of the database server. However, obviously, there are hardware induced limitations to vertical scaling.
Therefore, further capacity increases need to be realized by horizontal scaling (also known as scaling out), with multiple DBMS servers being arranged in a cluster. The respective nodes in the cluster can balance workloads among one another and scaling is achieved by adding nodes to the cluster, rather than extending the capacity of individual nodes. Such a clustered architecture is an essential prerequisite to cope with the enormous demands of recent evolutions such as big data (analytics), cloud computing and all kinds of responsive web applications. It provides the necessary performance, which cannot be realized by a single server, but also guarantees availability, with data being replicated over multiple nodes and other nodes taking over their neighbor’s workload if one node fails.
However, RDBMSs are not good at extensive horizontal scaling. Their approach towards transaction management and their urge to keep data consistent at all times, induces a large coordination overhead as the number of nodes increases. In addition, the rich querying functionality may be overkill in many big data settings, where applications merely need high capacity to ‘put’ and ‘get’ data items, with no demand for complex data interrelationships nor selection criteria. Also, big data settings often focus on semi-structured data or on data with a very volatile structure (think for instance about sensor data, images, audio data, and so on), where the rigid database schemas of RDBMSs are a source of inflexibility.
None of this means that relational databases will become obsolete soon. However, the ‘one size fits all’ era, where RDBMSs were used in nearly any data and processing context, seems to have come to an end. RDBMSs are still the way to go when storing up to medium-sized volumes of highly structured data, with strong emphasis on consistency and extensive querying facilities. Where massive volumes, flexible data structures, scalability and availability are more important, other systems may be called for. This need resulted in the emergence of NoSQL databases.
The Emergence of the NoSQL Movement
The term “NoSQL” has become overloaded throughout the past decade, so the moniker now relates to many meanings and systems. The modern NoSQL movement describes databases that store and manipulate data in other formats than tabular relations, i.e. non-relational databases. The movement should have more appropriately been called NoREL, especially since some of these non-relational databases actually provide query language facilities close to SQL. Because of such reasons, people have changed the original meaning of the NoSQL movement to stand for “not only SQL” or “not relational” instead of “not SQL”.
What makes NoSQL databases different from other, legacy, non-relational systems which have existed since as early as the 1970s? The renewed interest in non-relational database systems stems from Web 2.0 companies in the early 2000s. Around this period, up-and-coming web companies, such as Facebook, Google, and Amazon were increasingly being confronted with huge amounts of data to be processed, often under time-sensitive constraints. For example, think about an instantaneous Google search query, or thousands of users accessing Amazon product pages or Facebook profiles simultaneously.
Often rooted in the open source community, the characteristics of the systems developed to deal with these requirements are very diverse. However, their common ground is that they try to avoid, at least to some extent, the shortcomings of RDBMSs in this respect. Many aim at near linear horizontal scalability, which is achieved by distributing data over a cluster of database nodes for the sake of performance (parallelism and load balancing) and availability (replication and failover management). A certain measure of data consistency is often sacrificed in return. A term frequently used in this respect is eventual consistency; the data, and respective replicas of the same data item, will become consistent in time after each transaction, but continuous consistency is not guaranteed.
The relational data model is cast aside for other modelling paradigms, which are typically less rigid and better able to cope with quickly evolving data structures. Often, the API (Application Programming Interface) and/or query mechanism are much simpler than in a relational setting. The Comparison Box provides a more detailed comparison of the typical characteristics of NoSQL databases against those of relational systems. Note that different categories of NoSQL databases exist and that even the members of a single category can be very diverse. No single NoSQL system will exhibit all of these properties.
COMPARISON BOX
Relational Databases |
NoSQL Databases |
|
Data paradigm |
Relational tables |
Key-value (tuple) based |
Distribution |
Single-node and distributed |
Mainly distributed |
Scalability |
Vertical scaling, harder to scale horizontally |
Easy to scale horizontally, easy data replication |
Openness |
Closed and open source |
Mainly open source |
Schema role |
Schema-driven |
Mainly schema-free or flexible schema |
Query language |
SQL as query language |
No or simple querying facilities, or special-purpose languages |
Transaction mechanism |
ACID: Atomicity, Consistency, Isolation, Durability |
BASE: Basically available, Soft state, Eventually consistent |
Feature set |
Many features (triggers, views, stored procedures, etc.) |
Simple API |
Data volume |
Capable of handling normal-sized data sets |
Capable of handling huge amounts of data and/or very high frequencies of read/write requests |
Characteristics of NoSQL databases versus relational databases.
We note, however, that the explosion of popularity of NoSQL data storage layers should be put in perspective considering their limitations. Most NoSQL implementations have yet to prove their true worth in the field (most are very young and in development). Most implementations sacrifice ACID concerns in favor of being eventually consistent, and the lack of relational support makes expressing some queries or aggregations particularly difficult, with map-reduce interfaces being offered as a possible, but harder to learn and use, alternative.
Combined with the fact that RDBMSs do provide strong support for transactionality, durability and manageability, quite a few early adopters of NoSQL were confronted with some sour lessons. See for instance the FreeBSD maintainers speaking out against MongoDB’s lack of on-disk consistency support[1], Digg struggling with the NoSQL Cassandra database after switching from MySQL[2] and Twitter facing similar issues as well (which also ended up sticking with a MySQL cluster for a while longer)[3], or the fiasco of HealthCare.gov, where the IT team also went with a badly-suited NoSQL database[4]. It would be an over-simplification to reduce the choice between RDBMSs and NoSQL databases to a choice between consistency and integrity on the one hand, and scalability and flexibility on the other. The market of NoSQL systems is far too diverse for that. Still, this tradeoff will often come into play when deciding on taking the NoSQL route. We see many NoSQL vendors focusing again on robustness and durability. We also observe traditional RDBMS vendors implementing features that let you build schema-free, scalable data stores inside a traditional RDBMS, capable to store nested, semi-structured documents, as this seems to remain the true selling point of most NoSQL databases, especially those in the document store category. Some vendors have already adopted “NewSQL” as a new term to describe modern relational database management systems that aim to blend the scalable performance and flexibility of NoSQL systems with the robustness guarantees of a traditional DBMS.
Expect the future trend to continue towards adoption of such “blended systems”, except for use cases that require specialized, niche database management systems. In these settings, the NoSQL movement has rightly taught users that the one size fits all mentality of relational systems is no longer applicable and should be replaced by finding the right tool for the job. For instance, graph databases arise as being “hyper-relational” databases, which makes relations first class citizens next to records themselves rather than doing away with them altogether. Graph databases express complicated queries in a straightforward way, especially where one must deal with many, nested, or hierarchical relations between objects. The below table concludes this article by summarizing the differences between traditional RDBMSs, NoSQL DBMSs and NewSQL DBMSs.
Traditional SQL RDBMSs |
NoSQL databases |
Blended systems, “NewSQL” |
|
Relational |
Yes |
No |
Yes |
SQL |
Yes |
No, though can come with own query languages |
Yes |
Column stores |
No |
Yes |
Yes |
Scalability |
Limited |
Yes |
Yes |
Consistency model |
Strong |
Eventually consistent, though some efforts to enforce stronger consistency |
Strongly consistent for the most part |
BASE (Basically Available, Soft state and Eventually consistent) |
No |
Yes |
No |
Handles large (big) volumes of data |
No |
Yes |
Yes |
Schema-less |
No |
Yes |
No, though can store and query free-structured fields |
For more information, we are happy to refer to our book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com and YouTube channel: https://www.youtube.com/watch?v=o36Z_OqC2ac&list=PLdQddgMBv5zHc…
Wilfried Lemahieu is a professor at KU Leuven, Faculty of Economics and Business, where he also holds the position of Dean. His teaching, for which he was awarded a ‘best teacher recognition’ includes Database Management, Enterprise Information Management and Management Informatics. His research focuses on big data storage and integration, data quality, business process management and service-oriented architectures. In this context, he collaborates extensively with a variety of industry partners, both local and international. His research is published in renowned international journals and he is a frequent lecturer for both academic and industry audiences. See feb.kuleuven.be/wilfried.lemahieu for further details.
Bart Baesens is a professor of Big Data and Analytics at KU Leuven (Belgium) and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on Big Data & Analytics and Credit Risk Modeling. He wrote more than 200 scientific papers some of which have been published in well-known international journals and presented at international top conferences. He received various best paper and best speaker awards. Bart is the author of 8 books: Credit Risk Management: Basic Concepts (Oxford University Press, 2009), Analytics in a Big Data World (Wiley, 2014), Beginning Java Programming (Wiley, 2015), Fraud Analytics using Descriptive, Predictive and Social Network Techniques (Wiley, 2015), Credit Risk Analytics (Wiley, 2016), Profit-Driven Business Analytics (Wiley, 2017), Practical Web Scraping for Data Science (Apress, 2018) and Principles of Database Management (Cambridge University Press, 2018). He sold more than 20.000 copies of these books worldwide, some of which have been translated in Chinese, Russian and Korean. His research is summarized at www.dataminingapps.com.
Seppe vanden Broucke works as an assistant professor at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management and process mining. His work has been published in well-known international journals and presented at top conferences. He is also author of the book Beginning Java Programming (Wiley, 2015) of which more than 4000 copies were sold and which was also translated in Russian. Seppe’s teaching includes Advanced Analytics, Big Data and Information Management courses. He also frequently teaches for industry and business audiences. See seppe.net for further details.
[1] http://www.ivoras.net/blog/tree/2009-11-05.a-short-time-with-mongod…
[2] http://www.forbes.com/2010/09/21/cassandra-mysql-software-technolog…
[3] https://techcrunch.com/2010/07/09/twitter-analytics-mysql/
[4] https://gigaom.com/2013/11/25/how-the-use-of-a-nosql-database-playe…