Row vs Columnar format - slides (supplementing the YouTube video)
jansiekierski2
9 views
25 slides
Aug 30, 2025
Slide 1 of 25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
About This Presentation
Presentation uploaded to supplement the youtube video:
https://youtu.be/a38Bj7BCWFg
Size: 702.39 KB
Language: en
Added: Aug 30, 2025
Slides: 25 pages
Slide Content
Operational vs Analytical plane
Analytics Operations
Analytics Operations
●Operate on a single object
●Smaller data structures
●Very sensitive to latency
●Many concurrent users
●Long & wide datasets
●Not latency sensitive
●Few users
●Queries often target only a subset
of the data
●Batch processing is popular
Disclaimer:
I’m not describing user facing analytics here
ColumnRow
VS
Storage orientation
Column
orientati
on
Row
orientation
Row
orientation
Row
orientation
●Efficient writes on a single object
Row
orientation
●Efficient writes on a single object
●Efficient sequential reads
●Not efficient for analytical use
cases
Column
orientati
on
Avro
Avro
●binary row-oriented format
Avro
Write schema
Object 1
Object 2
Object N
Avro file
…
Kafka Topic
…
schema id
Object 1
data
Schema registry
…
Object N Schema N
Schema 1
Schema 2
Avro
●binary row-oriented format
●contains a schema and data
○in kafka: schema id + data
●supports schema evolution
Column
orientati
on
Column
orientation
Column
orientation
●Easy to skip columns if you don’t need them
●If you summarize your columns, better analytical query performance
●More efficient compression
Column
orientati
on
Parquet
Parquet file
…
Row group 1
…
Parquet file structure (high level)
Row group 1
Row group 2
Row group N
Footer (file, row group & column metadata)
Column B chunk 1
Column A chunk 1
Column C chunk 1
Column N chunk 1
Parquet
●Binary columnar format
●Widely adopted in Data Lakes and Lakehouses
●Highly performant for analytical use cases
●Very efficient compression
●Performance degrades if too many small files
?
Is Avro a valid choice
for long term
storage?
Compression efficiency
●Benchmark with two datasets:
○wide: 194GB data, 103 columns
○narrow: 82.8 million rows, 3 columns
●For narrow comparable space efficiency
●Wide dataset space efficiency:
○Parquet compressed to 4.7 GB
○Avro to 16.9 GB
https://blog.cloudera.com/benchmarking-apache-parquet-the-allstate-experience/
Long term storage?
●Data size is a primary driver of your costs
●In Avro, compression efficiency is highly sensitive to batch size
●For use cases where storage cost is not dominant, it might make sense to
keep data in Avro longer for convenience
Producer
Application
Analytics
Avro
Kafka
Cluster
S3 (Avro) Parquet
ETL
Kafka Connect
Operations
Data Lake
Data
Pipelines
Consumer
Application
Modern table formats
●Iceberg
●Delta Lake
●Time Travel
●ACID