How Big Is a Terabyte of Data?
By JINAG Buxing
It seems that one mile distance isn’t long, and that a cubic mile isn’t that big if compared with the size of the earth. You may be surprised if I tell you the entire world’s population could all fit in a cubic mile of space. Hendrik Willem van Loon, a Dutch-American writer, once wrote the similar thing in one of his books.
Teradata is a famous provider of database warehousing products. The brand name was designed to be impressive in handling massive amounts of data. That was 20 years back. Today users, as well as vendors, are talking about data in terabyte. It is common to have a scale of dozens of or nearly one hundred terabytes of data, even a petabyte of data – so common that one terabyte becomes unremarkable and that several or dozens of more terabytes seem not intimidating at all.
Actually a terabyte, as well as a cubic mile, is an enormous size. Though it’s difficult to feel it, we can understand it from two points of view.
First, let’s look at it spatially.
Most data analysis and computations are performed on structured data, among which the ever-increasing transaction data takes up the largest space. The size of each piece of transaction data isn’t big, from dozens of bytes to about one hundred bytes. For instance, the banking transaction information includes account, date and amount; and a telecom’s call records consist of phone number, time and duration. Suppose each record occupies 100 bytes, or 0.1 KB, a terabyte of space can accommodate 10G rows, or ten billion records.
What does this mean? There are a little more than 30 million seconds in a year, and to accumulate 1 TB of data in a year requires generating over 300 records per second around the clock.
It isn’t a ridiculously gigantic scale. In a large country like the U.S., businesses of national telecom operators, national banks, and internet giants are easily able to reach that scale. For a city wide or even some state wide institutions, however, it is difficult to get 1 TB of data. There is a slim possibility the tax information collected by local tax bureaus, the purchase data of a local chain store, or the transaction data of a city commercial bank can increase by 300 per second. Besides, data of many organizations is generated only on days or weekdays. To have dozens of, even one hundred terabytes of data, volume of business should be one or two orders of magnitude bigger.
A TB data may be too abstract for us to make sense of it. But by translating it to the volume of business, we can have a clear idea. There’s a close connection between the data amount and the technologies a big data analytics and computing product adopts, so it’s crucial for an organization to make a shrewd assessment of its data amount in order to build a big data platform well.
One terabyte of space becomes small if it is jammed with unstructured data like audio and video data, or if it is used to back up the original data. But generally we only perform storage management tasks or searching on those kinds of data. As there is no need to perform direct analysis and computation, a big data platform is thus unnecessary. A network file system is sufficient for performing those operations, which reduces cost a lot.
Now let’s look at it based on time.
How long is the processing of one TB of data? Some vendors claim that their products can process it within a few seconds. That’s what users expect. But is it possible?
The speed of retrieving data from HDD under an operating system is about 150MB per second (the technical parameters the hard disk manufacturer provides cannot fully achievable). The data retrieval is faster with an SSD, with a doubled speed of 300MB per second. It takes over 3000 seconds, which is nearly an hour, to retrieve one TB of data, without performing any other operations. How can one TB of data be processed in seconds? It is simply done by adding more hard disks. With 1000 hard disks, one TB of data can be retrieved within about 3 seconds.
That is an ideal estimate. In reality, most of the time data isn’t stored in neat order (Performance becomes terrible when discontinuous data is retrieved from a hard disk); for a cluster (Obviously 1000 hard disks cannot be installed in one machine), there’s the network latency, and some computations may need a rewriting operation (grouping with large result sets and sorting operation); data access within a few seconds is often accompanied by concurrent requests. Considering all these factors, it is not surprising that data retrieval will become several times slower.
Now we realize that a terabyte of data means several hours’ data retrieval, or 1000 hard disks. You can imagine what dozens of, or a hundred terabytes of data, will bring.
You may think that since the hard disk is too slow, we can use the memory instead.
Indeed, the memory is much faster than hard disk and is suitable for performing parallel processing. But a machine with a large memory is also expensive (the cost doesn’t increase linearly). To make matters worse, usually the memory usage ratio is low. For the commonly used Java-based computing platforms, JVM’s memory usage ratio is only about 20% if no data compression technique is employed. This means a 5TB memory is required to load 1 TB of data from the hard disk. That will be too expensive as a lot of machines are needed.
With some knowledge about 1 TB of data, we can have a quick and pretty good idea about the type of transaction, the number of nodes, and the deployment cost any time when we encounter multi-terabyte data. Then we won’t be misled when planning a computing platform or choosing a product. Even today, Teradata still carries a lively meaning.