Snowflake: Hybrid columnar storage – Suraraj's Jumping Pad

At this moment I do not have a personal relationship with a computer.
– Janet Reno

What is Hybrid columnar storage?
Example from Snowflake

What is Hybrid columnar storage?

Hybrid Columnar Storage is a database storage architecture that combines the benefits of columnar storage and row-based storage to improve query performance and optimize storage efficiency. This storage approach is commonly used in modern data warehouses like Snowflake. Let me explain Hybrid Columnar Storage and provide an example from Snowflake:

Columnar Storage:
- In traditional row-based databases, data is stored in rows, where each row contains all the columns of a table. This is efficient for transactional processing but less so for analytical queries.
- Columnar storage, on the other hand, stores data by column instead of by row. In a columnar storage system, all values of a single column are stored together. This allows for better compression, as similar values are grouped together, and it improves query performance for analytical workloads.
Row-Based Storage:
- In row-based storage, all the data for a single row is stored together. This is efficient for transactional workloads where you need to retrieve entire rows of data quickly.

Now, let’s look at how Snowflake implements Hybrid Columnar Storage:

Snowflake is a cloud-based data warehousing platform that uses a variant of Hybrid Columnar Storage called “Multi-cluster, shared data architecture.” It combines the benefits of both columnar and row-based storage to provide excellent performance and scalability.

Example from Snowflake:

Suppose you have a Snowflake database with a table called “Sales,” which contains data about sales transactions. This table might have columns like “Transaction_ID,” “Customer_ID,” “Product_ID,” “Transaction_Date,” “Quantity,” and “Price.”

	Transaction_ID	Customer_ID	Product_ID	Transaction_Date	Quantity	TotalAmount
Row 1	1001	101	201	01/10/2023	5	500.00
Row 2	1002	102	202	02/10/2023	3	300.00
Row 3	1003	103	103	03/10/2023	2	200.00

OrderID:   [1001, 1002, 1003]
ProductID: [101, 102, 103]
CustomerID: [201, 202, 203]
OrderDate: [2023-10-01, 2023-10-02, 2023-10-03]
Quantity:  [5, 3, 2]
TotalAmount: [500.00, 300.00, 200.00]

In Snowflake, the data is stored in columns, which is the columnar storage part of Hybrid Columnar Storage. So, the values for each column are stored together.
When you run a query to retrieve specific information, Snowflake’s query optimizer can efficiently read only the columns relevant to your query. For example, if you want to find the total sales revenue for a specific date range, Snowflake can read only the “Transaction_Date” and “Price” columns, avoiding unnecessary data retrieval from other columns.
Snowflake also uses various compression techniques to further optimize storage efficiency, reducing the storage space required for the columnar data.
In cases where you need to perform transactional operations or retrieve entire rows, Snowflake can use the row-based storage portion of the architecture to provide fast access to complete rows.

In summary, Snowflake’s implementation of Hybrid Columnar Storage combines the efficiency of columnar storage for analytical queries with the flexibility of row-based storage for transactional operations. This hybrid approach allows Snowflake to deliver high performance and scalability for a wide range of data warehouse workloads.