The evolution of the technologies in Big Data in the last 20 years has presented a history of battles with growing data volume. The challenge of big data has not been solved yet, and the effort will certainly continue, with the data volume continuing to grow in the coming years. The original relational database system (RDBMS) and the associated OLTP (Online Transaction Processing) make it so easy to work with data using SQL in all aspects, as long as the data size is small enough to manage. However, when the data reach a significant volume, it becomes very difficult to work with because it would take a long time, or sometimes even be impossible, to read, write, and process successfully.
Overall, dealing with a large amount of data is a universal problem for data engineers and data scientists. The problem has manifested in many new technologies (Hadoop, NoSQL database, Spark, etc.) that have bloomed in the last decade, and this trend will continue. This article is dedicated on the main principles to keep in mind when you design and implement a data-intensive process of large data volume, which could be a data preparation for your machine learning applications, or pulling data from multiple sources and generating reports or dashboards for your customers.
The essential problem of dealing with big data is, in fact, a resource issue. Because the larger the volume of the data, the more the resources required, in terms of memory, processors, and disks. The goal of performance optimization is to either reduce resource usage or make it more efficient to fully utilize the available resources, so that it takes less time to read, write, or process the data. The ultimate objectives of any optimization should include:
-
Maximized usage of memory that is available
-
Reduced disk I/O
-
Minimized data transfer over the network
-
Parallel processing to fully leverage multi-processors
Therefore, when working on big data performance, a good architect is not only a programmer, but also possess good knowledge of server architecture and database systems. With these objectives in mind, let’s look at 4 key principles for designing or optimizing your data processes or applications, no matter which tool, programming language, or framework you use.
Principle 1. Design based on your data volume
Before you start to build any data processes, you need to know the data volume you are working with: what will be the data volume to start with, and what the data volume will be growing into. If the data size is always small, design and implementation can be much more straightforward and faster. If the data start with being large, or start with being small but will grow fast, the design needs to take performance optimization into consideration. The applications and processes that perform well for big data usually incur too much overhead for small data and cause adverse impact to slow down the process. On the other hand, an application designed for small data would take too long for big data to complete. In other words, an application or process should be designed differently for small data vs. big data. Below lists the reasons in detail:
-
Because it is time-consuming to process large datasets from end to end, more breakdowns and checkpoints are required in the middle. The goal is 2-folds: first to allow one to check the immediate results or raise the exception earlier in the process, before the whole process ends; second, in the case that a job fails, to allow restarting from the last successful checkpoint, avoiding re-starting from the beginning which is more expensive. For small data, on the contrary, it is usually more efficient to execute all steps in 1 shot because of its short running time.
-
When working with small data, the impact of any inefficiencies in the process also tends to be small, but the same inefficiencies could become a major resource issue for large data sets.
-
Paralleling processing and data partitioning (see below) not only require extra design and development time to implement, but also takes more resources during running time, which, therefore, should be skipped for small data.
-
When working with large data, performance testing should be included in the unit testing; this is usually not a concern for small data.
-
Processing for small data can complete fast with the available hardware, while the same process can fail when processing a large amount of data due to running out of memory or disk space.
The bottom line is that the same process design cannot be used for both small data and large data processing. Large data processing requires a different mindset, prior experience of working with large data volume, and additional effort in the initial design, implementation, and testing. On the other hand, do not assume “one-size-fit-all” for the processes designed for the big data, which could hurt the performance of small data.
Principle 2: Reduce data volume earlier in the process.
When working with large data sets, reducing the data size early in the process is always the most effective way to achieve good performance. There is no silver bullet to solving the big data issue no matter how much resources and hardware you put in. So always try to reduce the data size before starting the real work. There are many ways to achieve this, depending on different use cases. Below lists some common techniques, among many others:
-
Do not take storage (e.g., space or fixed-length field) when a field has NULL value.
-
Choose the data type economically. For example, if a number is never negative, use integer type, but not unsigned integer; If there is no decimal, do not use float.
-
Code text data with unique identifiers in integer, because the text field can take much more space and should be avoided in processing.
-
Data aggregation is always an effective method to reduce data volume when the lower granularity of the data is not needed.
-
Compress data whenever possible.
-
Reduce the number of fields: read and carry over only those fields that are truly needed.
-
Leverage complex data structures to reduce data duplication. One example is to use the array structure to store a field in the same record instead of having each on a separate record, when the field shares many other common key fields.
I hope the above list gives you some ideas as to how to reduce the data volume. In fact, the same techniques have been used in many database software and IoT edge computing. The better you understand the data and business logic, the more creative you can be when trying to reduce the size of the data before working with it. The end result would work much more efficiently with the available memory, disk, and processors.
Principle 3: Partition the data properly based on processing logic
Enabling data parallelism is the most effective way of fast data processing. As the data volume grows, the number of parallel processes grows, hence, adding more hardware will scale the overall data process without the need to change the code. For data engineers, a common method is data partitioning. There are many details regarding data partitioning techniques, which is beyond the scope of this article. Generally speaking, an effective partitioning should lead to the following results:
-
Allow the downstream data processing steps, such as join and aggregation, to happen in the same partition. For example, partitioning by time periods is usually a good idea if the data processing logic is self-contained within a month.
-
The size of each partition should be even, in order to ensure the same amount of time taken to process each partition.
-
As the data volume grows, the number of partitions should increase, while the processing programs and logic stay the same.
Also, changing the partition strategy at different stages of processing should be considered to improve performance, depending on the operations that need to be done against the data. For example, when processing user data, the hash partition of the User ID is an effective way of partitioning. Then when processing users’ transactions, partitioning by time periods, such as month or week, can make the aggregation process a lot faster and more scalable.
Hadoop and Spark store the data into data blocks as the default operation, which enables parallel processing natively without needing programmers to manage themselves. However, because their framework is very generic in that it treats all the data blocks in the same way, it prevents finer controls that an experienced data engineer could do in his or her own program. Therefore, knowing the principles stated in this article will help you optimize process performance based on what’s available and what tools or software you are using.
Principle 4: Avoid unnecessary resource-expensive processing steps whenever possible
As stated in Principle 1, designing a process for big data is very different from designing for small data. An important aspect of designing is to avoid unnecessary resource-expensive operations whenever possible. This requires highly skilled data engineers with not just a good understanding of how the software works with the operating system and the available hardware resources, but also comprehensive knowledge of the data and business use cases. In this article, I only focus on the top two processes that we should avoid to make a data process more efficient: data sorting and disk I/O.
Putting the data records in a certain order, however, is often needed when 1) joining with another dataset; 2) aggregation; 3) scan; 4) deduplication, among other things. However, sorting is one of the most expensive operations that require memory and processors, as well as disks when the input dataset is much larger than the memory available. To get good performance, it is important to be very frugal about sorting, with the following principles:
-
Do not sort again if the data is already sorted in the upstream or the source system.
-
Usually, a join of two datasets requires both datasets to be sorted and then merged. When joining a large dataset with a small dataset, change the small dataset to a hash lookup. This allows one to avoid sorting the large dataset.
-
Sort only after the data size has been reduced (Principle 2) and within a partition (Principle 3).
-
Design the process such that the steps requiring the same sort are together in one place to avoid re-sorting.
-
Use the best sorting algorithm (e.g., merge sort or quick sort).
Another commonly considered factor is to reduce the disk I/O. There are many techniques in this area, which is beyond the scope of this article. Below lists 3 common reasons that need to be considered in this aspect:
-
Data compression
-
Data indexing
-
Performing multiple processing steps in memory before writing to disk
Data compression is a must when working with big data, for which it allows faster read and write, as well as faster network transfer. Data file indexing is needed for fast data accessing, but at the expense of making writing to disk longer. Index a table or file only when it is necessary, while keeping in mind its impact on the writing performance. Lastly, perform multiple processing steps in memory whenever possible before writing the output to disk. This technique is not only used in Spark, but also used in many database technologies.
In summary, designing big data processes and systems with good performance is a challenging task. The 4 basic principles illustrated in this article will give you a guideline to think both proactively and creatively when working with big data and other databases or systems. It happens often that the initial design does not lead to the best performance, primarily because of limited hardware and data volume in the development and test environments. Multiple iterations of performance optimization, therefore, are required after the process runs on production. Furthermore, an optimized data process is often tailored to certain business use cases. When the process is enhanced with new features to satisfy new use cases, certain optimizations could become not valid and require re-thinking. All in all, improving the performance of big data is a never-ending task, which will continue to evolve with the growth of the data and the continued effort of discovering and realizing the value of the data.