Billions of rows as an input Sub-second response time High query concurrency Input data always have unique primary key Sources are both batching and real-time streaming Joins between data sources Requirements ‹#›
Cube Store Architecture ‹#›
High throughput Meta Store Indexes are sorted copies Auto partitioning Parquet as a format Distributed FS as a storage layer Shared-nothing architecture Collocated join indexes Real-time in-memory chunks Tools: Rust, Arrow, DataFusion, RocksDB Design decisions ‹#›
Design decisions matrix Billions of rows as an input Sub-second response time High query concurrency Input data always have unique primary key Sources are both batching and real-time streaming Joins between data sources High throughput Meta Store ✅ ✅ Indexes are sorted copies ✅ ✅ ✅ ✅ Auto partitioning ✅ ✅ ✅ ✅ Parquet as a format ✅ Distributed FS as a storage layer ✅ ✅ Shared-nothing architecture ✅ Collocated join indexes ✅ ✅ ✅ ✅ Real-time in-memory chunks ✅ ✅ ✅ ‹#›
High Throughput Meta Store
High Throughput Meta Store Most accessed data in Cube Store Relatively small to data stored but still can be 10M+ of rows in size Strong read after write consistency Heavy read oriented but with significant write concurrency Should provide RDBMS experience RocksDB is a great tool for a problem
Meta Store Tables Schema Table Index Partition Chunk Job Source
Meta Store Access Flow
Meta Store key-value structure
Indexes are sorted copies
Indexes are sorted copies Sorting is the most efficient way to compress data for columnar data structures Most optimal for filtering Index per query is the way to go Paying write amplification for the speed of reads is ok Every algorithm can be merge-based: merge sort, merge join, merge aggregate.
Indexes are used based on filters
Data compression by sorting
Data compression by sorting
Auto partitioning
Auto partitioning Helps to keep all partitions relatively the same size Easy to implement as everything is sorted
Partition split at compaction
Parquet as a format
Parquet as a format De facto standard for storing columnar data Implements Partition Attributes Across (PAX) [1] Allows to leverage min-max statistics for filter push down optimizations [1] https://www.pdl.cmu.edu/PDL-FTP/Database/pax.pdf
Parquet file structure
Distributed FS as a storage layer
Distributed FS as a storage layer Storage-compute separation Compute nodes can be disposable and interchangeable Easy to build replication implementation Warmup everything that will be accessible by queries just before table becomes online to avoid download wait time
Distributed FS as a storage layer
Partition Warmup
Real-time in-memory chunks ‹#›
Real-time in-memory chunks Act as a buffer to avoid single row writes to Parquet Streaming rows sent to the partition owner worker In-memory chunks compacted at 1 minute intervals t o a persisted Parquet partition ‹#›
Routing for in-memory chunks ‹#›
Shared-nothing architecture ‹#›
Shared-nothing architecture Worker nodes never communicate with each other Router node receives only aggregated results and just needs to merge them ‹#›
Shared-nothing architecture ‹#›
Lambda architecture query ‹#›
Collocated join indexes ‹#›
Collocated join indexes Fastest distributed join algorithm Can be also used by other algorithms like shared-nothing Top-K ‹#›
Collocated join indexes ‹#›
Collocated join indexes ‹#›
Learn more about Cube & Cube Store @paveltiunov87 · c ube.dev Thanks! 🙌 ‹#›