Column-Oriented Storage

Basic idea:

  • Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time (SELECT *​ queries are rarely needed for analytics)

  • The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work

Column Compression

image

Often, the number of distinct values in a column is small compared to the number of rows (for example, a retailer may have billions of sales transactions, but only 100,000 distinct products). We can now take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

  • WHERE product_sk IN (30, 68, 69):​ Load the three bitmaps for product_sk = 30, product_sk = 68, and product_sk = 69, and calculate the bitwise OR of the three bitmaps, which can be done very efficiently.
  • WHERE product_sk = 31 AND store_sk = 3:​ Load the bitmaps for product_sk = 31 and store_sk = 3, and calculate the bitwise AND. This works because the columns contain the rows in the same order, so the kth bit in one column’s bitmap corresponds to the same row as the kth bit in another column’s bitmap.

Writing to Column-Oriented Storage

Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

Fortunately, we have already seen a good solution earlier in this chapter: LSM-trees. All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk

Aggregation: Data Cubes and Materialized Views

If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often?

image

LSM TREE

Short introduction this time. The data structure may worth a deep study!. The following contents are from deepseek:

LSM-Tree(日志结构合并树)简介

LSM-Tree(Log-Structured Merge-Tree)是一种高效处理高写入吞吐量的存储数据结构,广泛应用于现代数据库(如LevelDB、RocksDB、Cassandra)和存储引擎中。其核心设计目标是通过顺序写入优化写性能,同时通过分层合并(Compaction)平衡读写效率。


核心设计思想

  1. 写入优化

    • 内存缓冲(MemTable) :所有写入操作首先进入内存中的有序结构(MemTable),避免直接随机写磁盘。
    • 顺序写入磁盘(SSTable) :当MemTable填满后,冻结为不可变的SSTable(Sorted String Table)并顺序写入磁盘,减少磁盘寻址开销。
  2. 分层存储与合并(Compaction)

    • 分层结构:磁盘上的SSTable按层级组织,通常分为多个层级(如Level 0到Level N),低层级数据较新,高层级数据较旧且合并后更大。
    • 合并过程:后台定期将低层级的SSTable与高层级合并,消除重复数据(如删除标记),减少读取时的I/O放大问题。

关键组件

  1. MemTable

    • 基于内存的有序结构(如跳表或平衡树),支持快速插入和查询。
    • 写入达到阈值后冻结,转为SSTable写入磁盘。
  2. SSTable(Sorted String Table)

    • 不可变的有序键值存储文件,按主键排序。
    • 包含索引(如稀疏索引)和Bloom Filter,加速点查询。
  3. Compaction策略

    • Size-Tiered Compaction(大小分级):合并同层大小相近的SSTable。
    • Leveled Compaction(分层合并):每层仅保留一个SSTable,合并时跨层优化,减少空间放大。

工作流程

  1. 写入路径

    • 写入MemTable → MemTable写满后转为SSTable → 写入磁盘Level 0。
  2. 读取路径

    • 先查询MemTable → 未命中则逐层查询SSTable(Level 0到N)。
    • 使用Bloom Filter快速过滤不存在的键,减少磁盘访问。
  3. 合并流程

    • 后台线程选择层级中的SSTable进行合并,生成新的高层级SSTable,并删除旧文件。

优缺点分析

优点 缺点
✅ 高写入吞吐量(顺序写优化) ❌ 读取延迟可能较高(需查多层级)
✅ 天然支持数据压缩和去重 ❌ Compaction可能占用CPU/IO资源
✅ 适合SSD(减少随机写磨损) ❌ 范围查询效率依赖层级合并策略

应用场景

  1. 写密集型系统:日志存储、时序数据库(如InfluxDB)、物联网设备数据采集。
  2. 分布式数据库:Cassandra、HBase底层存储引擎。
  3. 消息队列:Kafka的日志存储(类似LSM思想)。

对比B-Tree

特性 LSM-Tree B-Tree
写入性能 高(顺序写) 低(随机写导致页分裂)
读取性能 点查询依赖Bloom Filter,范围查询较慢 点查询和范围查询均高效
存储开销 可能有写放大(Compaction) 空间利用率高,但需维护页结构
适用场景 写多读少,SSD环境 读多写少,传统硬盘环境

优化技术

  1. Bloom Filter:加速点查询,减少不必要的磁盘访问。
  2. 分层索引:为SSTable建立稀疏索引,快速定位数据块。
  3. 并行Compaction:多线程合并减少对前台业务的影响。
  4. Tiered vs Leveled Compaction:根据负载选择合并策略(写优化 vs 读优化)。

总结

LSM-Tree通过顺序写入分层合并机制,在高吞吐写入场景中表现卓越,尤其适合现代SSD存储和分布式系统。尽管读取效率可能低于B-Tree,但通过Bloom Filter和合理的数据分层策略,仍能满足多数实时需求。理解其核心原理和权衡,是设计高性能存储系统的关键。