Note: This is the second post of the series Data Tagging in Medical Imaging. You can find the first post here.
In the previous post of the series, Data Tagging in Medical Imaging, we gave you an overview of the kind of processes that you must put in practice to scale your data tagging engine. In this blog we will thoroughly discuss how to come up with these processes and things to consider before finalizing and formulating these processes. We will be discussing what these processes are and how they affect the data tagging process.
First of all, you need some resources to set up these processes. To set up them up, you need to make sure the availability of the following:
- A team of medical professionals that can annotate the data
- Data annotation software where the medical professionals will annotate the data
- And obviously…DATA!
So, let’s deep dive into these processes; what are they and what is their importance and role in efficient data tagging. We have listed down three main processes. Ideally, this is all one large workflow process designed to help you annotate your medical dataset in a faster, scalable and accurate way.
- Data Flow
- Quality Assurance/Quality Control
- Providing the data to the data scientists
Let’s discuss these aforementioned processes in details below:
Data Flow
It is really critical to define how the data will flow through all the stakeholders at different stages.
- Data Storage:
As data requirements for building a real-world AI solution is huge, you need to know where will be storing terabytes of data. You can either store it on the cloud using a number of cold storage such as Google Coldline, Google Nearline and Amazon Glacier etc. or choose to store it locally – however, until you have a strong scalable local infrastructure with perfect power backup, we recommend that you use cloud for all the data flow. Cloud just gives you a lot of flexibility – and the storage is not that costly. Other factors should be considered as well while deciding your storage facility such as the quantity of data you are storing and how many times will you retrieve the data from your storage etc. Noticeably, cold storage is used to store inactive data and hence, they are extremely cost-effective. But the data retrieval is costlier with cold storage platforms. So, it is advised to keep some amount of training data locally, using devices such as NAS(Network Attached Storage), which makes retrieval of a small amount of data easier.
- Serving the Data:
Medical professionals annotate the data that is served to them. This data is served within the data annotation software developed by your software development team. All the medical professionals who are going to annotate the data should have proper login credentials for better visibility and progress tracking. There are many caveats to consider while serving the data to the medical professionals based on their experience. In our data tagging engine, a senior radiologist will have different privileges than a junior radiologist, based on the roles defined. Similarly, you also have to ensure that all stakeholders see the information they need to see within the software. For example, non-personal information on the subject such as age bracket, gender, morbidity like diabetes etc. This supplies additional information to the medical professionals. Thus, allowing them to annotate accurately.
Quality Assurance/Quality Control
This is perhaps the most important process in your workflow. The entire purpose of this exercise is to create a dataset that could be used to train the machines. The models that you create are only as accurate as the training dataset. Following are the things that you could do to create a strong QA/QC layer in your workflow.
- Verification of annotations:
Each and every tag needs to be thoroughly verified by a medical professional with a different privilege level. This double verification ensures a lesser error rate in the annotated dataset. - Feedback to medical professionals:
It is essential to continuously monitor the performance of the medical professionals and subsequently provide them the feedback. It could happen that some medical professionals are very good at annotating NCCT Head but not MRI Head. These things need to be learned and then be acted upon accordingly.
Providing annotated data to the Data Scientists:
This is final layer – and there is a lot of scope for miscommunication in this layer. The entire purpose of annotating all the medical data is to provide it to the Data Science team for them to build the model. The Data Science team is always running a number of experiments for which they need a varied amount of data. In between all the experiments, it is very easy for the Data Scientists to be lost with a stream of annotated data coming from the Tagging Engine. This makes maintenance of data flow logs really important.
In this blog, we discussed the initial requirements of setting up a scalable tagging engine and core processes involved. Processes such as Data flow, QA/QC and training of medical professionals is critical in deploying a smart tagging engine. In achieving this, the proper formulation of the aforementioned processes is critical.
In the next blog of this series, we will discuss the most important aspect of the data tagging capabilities, Quality Control. You can find the first blog of the series here. Please watch this space for more.