How extensive must data sets be to be considered as big data? For some, a slightly larger Excel spreadsheet is “big data”. Fortunately, there are certain characteristics that allow us to describe big data pretty well.
According to IBM, 90% of the data that exists worldwide today was created in the last 2 years alone. Big Data Analysis in Healthcare could be helpful in many ways. For example, such analyzes may also counteract the spread of diseases and optimize the needs-based supply of medicinal products and medical devices.
In this article, we will define what is Big Data and discuss ways it could be applied in Healthcare.
Big data definition
The easiest way to say is: Big data is data that can no longer only be processed by one computer. They are so big that you have to store and edit them piece by piece on several servers.
A short definition can also be expressed by three Vs:
- Volume – describes the size of the data
- Variety – a variety of data
- Velocity – the speed of the data
Volume – The Size of Data
As I said before, big data is most easily described by its sheer volume and complexity. These properties do not allow big data to be stored or processed on just one computer. For this reason, this data is stored and processed in specially developed software ecosystems, such as Hadoop.
Variety – Data Diversity
Mass data is very diverse and can also be structured, unstructured or semi-structured.
These data also mostly have different sources. For example, a bank could store transfer data from its customers, but also recordings of telephone conversations made by its customer support staff.
In principle, it makes sense to save data in the format in which it was recorded. The Hadoop Framework enables companies to do just that: the data is saved in the format in which it was recorded.
With Hadoop, there is no need to convert customer call data into text files. They can be saved directly as audio calls. However, the use of conventional database structures is then also not possible.
Velocity – The Speed of Data
This is about the speed at which the data is saved.
It is often necessary that data be stored in real-time. For companies like Zalando or Netflix, it is thus possible to offer their customers product recommendations in real-time.
Big Data Implementation in the Healthcare
There are three most obvious, but fundamentally revolutionizing ways of Big Data usage coupled with artificial intelligence.
- On the one hand, the monitoring. Significant deviations in essential body data will be automatically enhanced in the future: Is the increased pulse a normal sequence of the staircase just climbed? Or does he point to cardiovascular disease in combination with other data and history? Thus, diseases can be detected in their early stages and treated effectively.
- Diagnosis is the second one. Where it depends almost exclusively on the knowledge and the analysis capacity of the doctor, whether, for example, the cancer metastasis on the X-ray image is recognized as such, the doctor will use artificially intelligent systems, which become a little smarter with each analyzed X-ray image because of Big Data technology. The error probability in the diagnosis decreases, the accuracy in the subsequent treatment increases.
- And third, after all, Big Data and artificial intelligence have the potential to make the search for new medicines and other treatment methods much more efficient. Today, countless molecular combinations must first be tested in the Petri dish, then in the animal experiment, and finally in clinical trials on their effectiveness, maybe a new drug in the end. A billion company roulette game, in which the winning opportunities can be significantly increased by computer-aided forecasting procedures, which in turn access a never-existed wealth of research data.
As with every innovation in the health system, it’s about the hopes of people to a longer and healthier life. For the urgent that you could be torn from life prematurely through cancer, heart attack, stroke, or another insidious disease from life.
If you want to examine the case of Big Data in practice, you can check this Big Data in the Healthcare Industry article.
Technology Stack
Apache Hadoop Framework
To meet these special properties and requirements of big data, the Hadoop framework was designed as open-source. It basically consists of two components:
HDFS
First: It stores data on several servers (in clusters) as so-called HDFS (Hadoop Distributed File System). Second: it processes this data directly on the servers without downloading it to a computer. The Hadoop system processes the data where it is stored. This is done using a program called MapReduce.
MapReduce
MapReduce processes the data in parallel on the servers, in two steps: first, smaller programs, so-called “mappers”, are used. Mappers sort the data according to categories. In the second step, so-called “reducers” process the categorized data and calculate the results.
Hive
The operation of MapReduce requires programming knowledge. To make this requirement a little easier, another superstructure was created on the Hadoop framework – Hive. Hive does not require any programming knowledge and is based on the HDFS and MapReduce framework. The commands in Hive are reminiscent of the commands in SQL, a standard language for database applications, and are only then translated in MapReduce in the second step.
The disadvantage: it takes a little more time because the code is still translated into MapReduce.
The amount of data available is increasing exponentially. At the same time, the costs of saving and storing this data also decrease. This leads many companies to save data as a precaution and check how it can be used in the future. As far as personal data is concerned, there are of course data protection issues.
Final thoughts
In this article, I don’t mean to call a Big Data groundbreaking shot today. I believe it’s something that should be adopted widely, and that already has been taken by a lot of world-famous companies.
In the course of the digitization of the health system in general and currently, also with the corona crisis in particular, there are also new questions for data protection. The development and use of ever further technologies, applications, and means of communication offer a lot of benefits but also carries (data protection) risks. Medical examinations in video chat, telemedicine, attests over the internet and a large number of different health apps mean that health data does not simply remain within an institution like a hospital, but on private devices, on servers of app developers, or other places.
Firstly we have to deal with the question of which data sets are actually decisive for the question that we want to answer with the help of data analysis. Without this understanding, big data is nothing more than a great fog that obscures a clear view through technology-based security.