A Data Lake has flexible definition, to make this statement true the dataottam team took initiative and released a eBook called “The Collective Definition of Data Lake by Big Data Community”, which contains many definitions from various business savvy and technologist.
And in nutshell Data Lake is a data store and processing data system, where an organization can place internal data, external data, partner’s data, competitor data, business process, social data, and people data. Data Lake is not Hadoop. And it leverages the Store-All principle of data. Data Lake is scientist preferred data factory.
- Scalability – It is the capability of a data system, network, or process to handle a growing amount of data or its potential to be enlarged in order to accommodate that data growth. One of the horizontal scalability tools is Hadoop, which leverages the HDFS storage.
- Converge All Data Sources – Hadoop powered to store the multi structured data from diverse set of sources. In simple words the Data Lake has ability to store logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
- Accommodate High Speed Data – In order to have the high speed data in the Data Lake, it should use few of the tools like Chukwa, Scribe, Kafka, and Flume which can acquire and queue the high speed data. By leveraging this high speed data can integrate with the historical data to have its fullest insights.
- Implant the Schema – To have insights and intelligence from the data, which is stored in the Data Lake we should implant the schema for the data and make the data flow in analytical system. And the data lake can able to leverage both structured and unstructured data.
- AS-IS Data Format – In legacy data system the data is modeled as cubes at the time of data ingestions or ingress. But in the data lake removes the need for data modeling at the time of ingestion; we can do it in the time of consuming. It offers unmatched flexibility to ask any business, domain questions and to seek insights and intelligence answers.
- The Schema – The traditional data warehouse will not support schema less storage. But the Data Lake leverages the Hadoop simplicity to store the data based on schema less write and schema based read mode, which is very much useful at the time of data consumptions.
- The favorite SQL – Once the data is ingresses, cleansed, and stored in a structured SQL storage of the Data Lake, we can reuse the existing PL-SQL/DB2 SQL scripts. The tools such as HAWQ, Impala, Hive, and Cascading gives us the flexibility to run massively parallel SQL queries while simultaneously integrating with advanced algorithm libraries such as MLlib, MADLib and applications such as SAS. Performing the SQL processing inside the Data Lake decreases the time to achieving results and also consumes far less resources than performing SQL processing outside of it.
- Advanced Analytics: Unlike a data warehouse, the Data Lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms to recognize items of interest that will power real-time decision analytics.
Originally posted here.