Skip to content

Data Lake explained!

At this moment I do not have a personal relationship with a computer.

Janet Reno

Data Lake is also a great solution for organizations that need to store raw data(for all structured and unstructured data) from source systems on the cloud.

Since this layer of data storing deals with both structured and unstructured data, the schema is not mandatory when you are storing the data in a data lake, which means that data does not need to be fitted into a pre-defined scheme before storage. Only when the data is processed and read, it is adapted and parsed into a schema. This saves time which is otherwise spent on the schema definition.

There is also no quality enforcement for data loading. This is a double-edged sword as the advantage of Data Lake enables the storing of multiple types of data, however, due to a lack of quality enforcement, this can lead to potential inconsistencies in the data. 

With Data Lake, there is no consistency or isolation. This means it is not possible to read or append when an update is in progress.

Folder partitioning on the Data Lake

(Image from Microsoft Tech community)

Partitioned data is a folder structure that enables us faster search for specific data entries by partition pruning/elimination when querying the data.

Storing data in a partitioned folder structure will help improve data manageability and query performance. As an example, if sales data is stored and partitioned by the date (year, month, day), this data will be split based on the date value.

Published inData WarehousePersonal PostsTechnical Posts