Optimizing Tiered Storage for Low-Latency Real-Time Analytics by Neha Pawar
ScyllaDB
0 views
46 slides
Oct 14, 2025
Slide 1 of 46
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
About This Presentation
Real-time OLAP databases usually trade performance for cost when moving from local storage to cloud object storage. This talk shows how we extended Apache Pinot to use cloud storage while still achieving sub-second P99 latencies. We’ll cover the abstraction that makes Pinot location-agnostic, stra...
Real-time OLAP databases usually trade performance for cost when moving from local storage to cloud object storage. This talk shows how we extended Apache Pinot to use cloud storage while still achieving sub-second P99 latencies. We’ll cover the abstraction that makes Pinot location-agnostic, strategies like pipelining, prefetching, and selective block fetches, and how to balance local and cloud storage for both cost efficiency and speed.
Size: 3.21 MB
Language: en
Added: Oct 14, 2025
Slides: 46 pages
Slide Content
A ScyllaDB Community
Optimizing Tiered Storage for
Low-latency Real-time Analytics
Neha Pawar
Founding Engineer, StarTree
Apache Pinot Committer & PMC
For Apache Pinot, in StarTree Cloud
Apache Pinot’s Tiered Storage Journey
■What is Apache Pinot
■How we built tiered storage
■How we optimized it for low-latency querying
Server 3 Server 4
Cloud Object Storage
Server 1 Server 2
Disk/SSD Disk/SSD
Fully tightly-coupled
Tiered Storage for Apache Pinot in StarTree Cloud
Hybrid
Disk / SSD
Access speed micro to milliseconds
Access method POSIX APIs
Access
availability
Single instance
Cost $
Cloud Object Storage
100s of milliseconds
Network call
Shared across instances
⅕ $
Remote access abstractions
Pinot query
engine was
enhanced to be
agnostic to
segment location
What data to
read?
Two key questions
When to read?
Why lazy-loading won’t work
Server 1 Server 2
Cloud Object Storage
1st query slow
2nd query fast
Pinot segment
Q: What to read?
A: Entire segment
Q: When to read?
A: During query
execution
●Non-predictable OLAP workload
●Instance storage limited
●Wasteful data fetched
●Strict no-go for OLAP
Pinot segment format
Columns:
browser, region, country,
impressions, cost,
timestamp
browser.fwd_idx
browser.inv_idx
Browser.dict
region.inv_idx
region.fwd_idx
region.dict
country…
…
…
impressions.fwd_idx
impressions.dict
cost..
…
timestamp..
..
Forward index
Inverted index
Dictionary
select sum(impressions)
where region=”rivendell”
columns.psf
browser.fwd_idx
browser.inv_idx
Browser.dict
region.inv_idx
region.fwd_idx
region.dict
country…
…
…
impressions.fwd_idx
impressions.dict
cost..
…
timestamp..
..
What to read - Selective columnar fetch
Server 1
Cloud Object Storage
Only fetch
region.dict, region.inv_idx,
impressions.fwd, impressions.dict
Range GET
When to read - Fetch during segment execution?
Planning phase Make segment execution plan
Pinot Server
Pinot server processes
segments in parallel
Segment
Execution
Segment
Execution
Segment
Execution
Planning phase Make segment execution plan
Pinot Server
Columnar Fetch from S3
Segment
Execution
Segment
Execution
Segment
Execution
Fetch during segment execution?
Fetch
Release
Fetch
Release
Fetch
Release
40 segments, 8 parallelism
Fetch for batch 1: 200ms
Fetch for batch 2: 200ms
Fetch for batch 3: 200ms
Fetch for batch 4: 200ms
Fetch for batch 5: 200ms
Each S3 access: ~200ms
CPU idle when fetching.
Decouple fetch and execute?
Total time = executionTimeMs + 1000ms
Fetch during segment execution
Planning phase Make segment execution plan &
prefetch all segment columns
Pinot Server
Prefetch
Wait for segment
columns to be available
Segment
Execution
Segment
Execution
Segment
Execution
Acquire
Release
Acquire
Release
Acquire
Release
# of segments: 40 segments; parallelism: 8; Each S3 access: ~200ms
Fetch for batch 1: 200ms
Fetch for batch 2: 200ms
Fetch for batch 3: 200ms
Fetch for batch 4: 200ms
Fetch for batch 5: 200ms
Before prefetch
Total time = executionTimeMs + 1000ms
Pipelining fetch and execution
Prefetch for ALL batches
during planning: 200ms
After prefetch
Total time = executionTimeMs + 200ms
Pinot
Broker
Query
Server 1 Server 2 Server 3
1 2
3 4
5 6
7 8
9 10
11 12
1 2 3 4 5 6 7 8 9 10 11 12
Total segments to process
Server 1 Server 2 Server 3
1 2 3 4 5 6 7 8 9 10 11 12
Broker level pruning
5 6 7 8 9 10 11 12
Server level pruning
What makes Pinot fast?
Pruning technique
Min/max value based pruning
How to access
Computed by default & cached locally
Segment Pruning
Partition based pruning Computed for partitioned data & cached locally
Bloom filter based pruning Stored inside each segment (which is now on S3)
Pinot
Broker
Query
Server 1 Server 2 Server 3
1 2
3 4
5 6
7 8
9 10
11 12
1 2 3 4 5 6 7 8 9 10 11 12
Total segments to process
Server 1 Server 2 Server 3
1 2 3 4 5 6 7 8 9 10 11 12
Broker level pruning
5 6 7 8 9 10 11 12
Server level pruning
What makes Pinot fast?
Filter optimizations, Aggregation optimizations
Inverted Index
Sorted Index
Range Index
JSON Index
What makes Pinot fast - Indexes
Text Index
Geo Index
Filtering Optimizations
Server 1 Server 2
Pinot segment
12
56
34
7
Pin any column index locally
preload.index.keys :
account_id.bloom_filter,
transaction_id.bloom_filter
Server 3 Server 4
account_id.bloom_filter,
transaction_id.bloom_filter
Query
Decoupled
System
SELECT COUNT (*)
FROM GithubEventsTier
WHERE DAY >= 20200701 AND DAY <= 20200714
5340ms
Pinot
Tiered Storage
63ms
SELECT MAX(pull_request_additions )
FROM GithubEventsTier
WHERE DAY >= 20201101 AND DAY <= 20201114 AND type =
'PullRequestEvent'
1580ms 350ms
SELECT MAX(pull_request_commits ), COUNT(*), repo_id
FROM GithubEventsTier
WHERE type = 'PullRequestEvent' AND DAY = 20201114
GROUP BY repo_id ORDER BY COUNT(*) DESC
LIMIT 1000
1400ms 278ms
SELECT SUM(pull_request_commits )
FROM GithubEventsTier
WHERE JSON_MATCH(actor_json,
'"actor_id"=''39814207''' )AND DAY = 20200701
8560ms 397ms
Tested with about 300 segments (200GB in total) with one r5.2xlarge Pinot server
Pinot
Tightly coupled
5ms
35ms
49ms
8ms
Benchmark
Q: What to
read?
A: Selective
Columnar
fetch +
Pruning +
Index pinning
Q: When to
read?
A: Prefetch
during
planning
Read less data
Increase parallelism
■Network bandwidth could be saturated easily
●Fetching the full column from requested segments
●Puts pressure on prefetch resources
●Once saturated, linear latency (1.25 GB/s)
■When network not saturated
●More I/O parallelism needed to lower query latency
Stress test and Lessons learnt
Selectivity
Columnar fetch
Latency
High
Medium
Low
Block fetch
Block reads
Data read
330 ms 131MB
3500 ms 3.3GB
50000 ms 54GB
Latency Data read
520 ms 12MB
1520 ms 320MB
15000 ms 6GB
Reduced bytes fetched
Identify and prefetch these chunks during
planning phase, using Sparse Indexes
OR
Prefetch right after filter phase
Query Column read
SELECT payload_pull_request // <- large raw forward
index
FROM github_events_bc_big
WHERE repo_name = ‘...’ // <- high selectivity filter
1593 ms
Block read
272 ms
SELECT id // <- dictionary encoded column
FROM github_events_bc_big
WHERE repo_name = ‘...’ // <- high selectivity
filter
158 ms 39 ms
Benchmark
Q: What to
read?
A: Blocks
Q: When to
read?
A: Prefetch
w/ post-filter
parallelism
Apache Pinot’s Tiered Storage Journey
■What is Apache Pinot
■How we built tiered storage
■How we optimized it for low-latency querying
No lazy loading
No cache dependent
solutions
What to read
●Selective columnar
fetch
●Block reads
When to read
●Pipeline fetch &
execution
●Prefetch
●Pinning
●Post filter parallelism
What makes Pinot fast
User-facing real-time analytics - Uber
Missed
orders
Inaccurate
orders
Downtime
Top selling
items
Menu item
Feedback
Apache Pinot Community
Cost
FlexibilityPerformance
Flexibility of using
a single system
for historical +
real-time data
Without worrying
about the cost
spiralling out of
control
Tiered Storage for Apache Pinot in StarTree
Cloud
?
All data is not equal
All data is not equal
Missed
orders
Inaccurate
orders
Downtime
Top selling
items
Menu item
Feedback
Real-time, user-facing analytics
Latency: Milliseconds
Concurrency: Millions of users
Real-time Data Latency sensitive Historical data Cost sensitive
Internal dashboards, reporting, ad-hoc
Latency: sub-seconds
Concurrency: 100s of users
What analytics infra should I choose?
Cost
FlexibilityPerformance
Service the different
workloads, while
maintaining the
different query and
freshness SLAs
Performance
Cost to serve justifies
business value
extracted
Cost
Easy to operate, to
configure, fulfils lots
of requirements
Flexibility
Real-time analytics (OLAP) databases
Tightly coupled storage & compute
Server 1
Disk/SSD Disk/SSD Disk/SSD Disk/SSD Disk/SSD
Server 2 Server 3 Server 4 Server 5
Paying for compute
which could remain unutilized
Disk/SSD storage is expensive
(compared to Cloud Object Storage)
Num Compute Units 1
Storage (TB) 2
Monthly Compute Cost $200
Monthly Storage Cost $200
Total Monthly Cost $400
5
10
$1000
$1000
$2000
10
20
$2000
$2000
$4000
100
200
$20000
$20000
$40000
Tightly Coupled Systems - High Cost to Serve
Cost
FlexibilityPerformance
Retention in months
Query Frequency
Latency (ms)
Cost to Serve
$$$$$$
Data warehouses, data lakes, lake houses
Server 1 Server 2
Cloud Object Storage
Decoupled Systems - Tradeoff cost for latency
Cost
FlexibilityPerformance
Retention in months
Latency (seconds)
Cost to Serve
$
Query Frequency
Cost
FlexibilityPerformance
Use both systems together
Real-time Data Historical data
Data warehouses,
data lakes
Realtime analytics
databases
Data Retention
Latency (seconds)
Cost to Serve
$
Query Frequency
Takeaways
Data Retention
Latency (ms)
Cost to Serve
$$$$$$
Data Retention
Latency (seconds)
Cost to Serve
$
Query Frequency
Tightly-coupled Decoupled Tiered Storage in Apache Pinot
●Flexibility of using a single system for historical + real-time data
●Without worrying about the cost spiralling out of control
●Better performance than traditionally decoupled systems. Pushing the boundaries to get closer and closer
to tightly-coupled latencies