By JIANG Buxing, Data Scientist
As big data concept gains momentum, unstructured data analytics technology is becoming hot. It is said that 80% of all data in an enterprise is unstructured, which is roughly true when it is measured by the space, considering the large size of the audio and video data. With huge amounts of data at hand, some technique is required to analyze them.
But make sure you are not misled by the universal unstructured data analytics technology?
There is no universal unstructured data computing technology
The unstructured data involves a variety of formats, such as audio data, images, texts, web data, office documents, device logs. Each format of data needs a specific technique to process, like speech recognition, image comparison, full-text search and graphic computation. There isn’t a technique that can be applied to the analysis of all formats of unstructured data. For a certain format, there’s no way you should replace the image comparison technique with speech recognition technique, or substitute full-text search with graphic computation.
A software vendor who specializes in a certain technology will definitely advertise its domain, like facial recognition technology or text mining, clearly, instead of just claiming that it is an expert but offering nothing special. Obviously it is easier to find target customers and market with a highly professional product. A vendor who peddling unstructured data analytics but fails to offer a professional product is Jack of all trades, master of none.
There is a universal unstructured data storage technology
There are indeed certain technological fields where unstructured data analytics is dominant. But in other fields, the needs of users concentrate on the proper store of the unstructured data. On average, the unstructured data analytics technology isn’t a universal demand. Though there isn’t an all-embracing unstructured data computing and analytics technology, a universal storage and management (including adding and deleting data, and data search) technology does exist. Since the unstructured data occupies much larger space than the structured data does, it needs a different technique to be stored.
Unless the size of data is particularly huge, or high concurrency is required for performing data search, most of the NFS systems (like HDFS) are capable enough to meet demand of data store and access. Yet it seems a vendor is less technological if it sells no more than unstructured data storage and management services. So advertising analytics is what many software vendors strive for, even they don’t have any substantial services to offer. In contrast, a real storage service provider who offers high-capacity and high-performance data access focuses on promoting storage infrastructure, rather than data analytics solutions.
Structured data analytics is the underlying rock
The collection of unstructured data is often accompanied by the collection of structured data, such as the producer of a piece of audio or a video, time, type, duration, and so on. Sometimes unstructured data will become structured data after processing. A web log, for instance, may be split and generate visitor IP addresses, access time, key words, and other attributes. Then the so-called unstructured data issue is in essence a structured one. And there are already some mature standard structured data analytics technologies, such as relational algebra and relational databases.
Yet, to grab users’ attention, vendors who just go with the tide concocted the concept of unstructured data analytics to disguise the underlying rock – the structured data problem.
That is why users, the demand side, need to understand clearly what treatment their data requires. If the data needs a proper storage only, then an open-source NFS system, say HDFS, is sufficient. If high-performance access is needed, go to a storage vendor. If it is the generated structured data that needs an analysis, that falls within the scope of the familiar database processing. If the data needs a specialized processing, find a professional vendor and technology in the specialized area. In a word, try to be exact about your data processing type.