QuestDB: The building blocks of a fast open-source time-series database

supercoco9 216 views 40 slides Dec 14, 2023
Slide 1
Slide 1 of 40
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40

About This Presentation

(talk delivered at OSA CON 23)

Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing ...


Slide Content

December 12-14 2023 December 12-14 2023
The Building Blocks of a
Time-Series database
Javier Ramirez
Database advocate
@supercoco9

Timestamp problems are hard

https://twitter.com/rshidashrf/status/975566803052134400

https://stackoverflow.com/questions/6841333/why-is-subtracting-these-two-epoch-milli-times-in-year-1927-giving-a-strange-r

1553983200
* This is a date. Which one?

When is this?

1/4/19
* is this April 19th? january 4th? April 1st?

Working with timestamped data in a
database is tricky*
* specially working with analytics of data changing over time or at a high rate

If you can use only one
database for everything, go
with PostgreSQL*
* Or any other major and well supported RDBMS

Some things RDBMS are not designed for
●Writing data faster than it is read (several millions of inserts per day and faster)
●Aggregations scoped to different time units (per year/minute/microsecond)
●Identifying gaps or missing data for a given interval
●Joining tables by approximate timestamp
●Sparse data (tables with hundreds or thousands of columns)
●Aggregates over billions of records

●a factory floor with 500 machines, or
●a fleet with 500 vehicles, or
●50 trains, with 10 cars each, or
●500 users with a mobile phone

Sending data every second
How I made my first billion

86,400
* Seconds in one day

604,800
* Seconds in one week

2,628,288
* Seconds in one month. Well, in the average month of 30.42 days anyway

43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month

How I made my first billion
* See? On streaming data, It is kind of easy to get your first billion of data points

Not all data
problems are
the same

Time-series database basics

●Optimised for fast append-only ingestion
●Data lifecycle policies
●Analytics over chunks of time
●Time-based aggregations
●Often power real-time dashboards

QuestDB would like to be known for:
●Performance
○Also with smaller machines

●Developer Experience

●Proudly Open Source (Apache 2.0)

Fast streaming
ingestion
* You can try ingesting streaming data using https://github.com/javier/questdb-quickstart

QuestDB ingestion and storage layer
●Data always stored by incremental timestamp.

●Data partitioned by time units and stored in columnar format.

●No indexes needed. Data is immediately available after writing.

●Predictable ingestion rate, even under demanding workloads (millions/second).

●Built-in event deduplication.

●Optimized data types (Symbol, geohash, ipv4, uuid).

●Row updates and upserts supported.

Lifecycle policies
ALTER TABLE my_table DROP PARTITION LIST '2021-01-01', '2021-01-02';
--Delete days before 2021-01-03
ALTER TABLE my_table DROP PARTITION WHERE timestamp < to_timestamp('2021-01-03', 'yyyy-MM-dd');
ALTER TABLE x DETACH PARTITION LIST '2019-02-01', '2019-02-02';


-- It is also possible to use WHERE clause to define the partition list
ALTER TABLE sensors DETACH PARTITION WHERE < '2019-02-03T00';
CREATE TABLE my_table (i symbol, ts timestamp) IN VOLUME SECONDARY_VOLUME;

Connectivity, protocols, and interfaces
●REST API and web console: Query execution, CSV imports/exports. Basic charts.

●Pgwire: perfect for querying, DDL, and DML. Ingestion supported, up to moderate
throughput. Compatible with any low-level postgresql client or library.

●Influx Line Protocol(ILP): socket-based, ingestion only, very high throughput. Official clients
available for C/C++, JAVA, Python, Rust, Go, NodeJS, and .Net.

●Health/Metrics: HTTP endpoint with Prometheus format

●Integrations with: Apache Kafka, Apache Flink, Apache Spark, Python Pandas, Grafana,
Superset, Telegraf, Redpanda, qStudio, SQLAlchemy, Cube…

The query engine

QuestDB Query engine internals
●Our Java codebase has zero dependencies. No garbage collection on
the hot path. As close to the hardware as possible.

●We research the latest trends. Our code takes advantage of the
state-of-the-art in CPU, storage design, and data structures.

●We implement our own Just in Time Compiler to make query execution
as parallel and fast as possible.

●We spend weeks of development to save microseconds or
nanoseconds in many operations.

The query language: SQL with
time-series extensions

LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple
time series are stored in the same table.


SELECT * FROM trades
LATEST ON timestamp PARTITION BY symbol;
Try it live on
https://demo.questdb.io

LATEST ON … PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple
time series are stored in the same table.


SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
Try it live on
https://demo.questdb.io

SAMPLE BY
Aggregates data in homogeneous time chunks


SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
Try it live on
https://demo.questdb.io

How do you ask your database to
return which data is not stored?

I am sending data every second or
so. Tell me which devices didn’t
send any data with more than 1.5
seconds gap

SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)


SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
Try it live on
https://demo.questdb.io

WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';






Try it live on
https://demo.questdb.io

WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';



SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;








Try it live on
https://demo.questdb.io

WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';



SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;


SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'







Try it live on
https://demo.questdb.io

What if I have two tables, where
data is (obviously) not sent at the
same exact timestamps and I want
to join by closest matching
timestamp?

ASOF JOIN (LT JOIN and SPLICE JOIN variations)
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN takes from
the second time-series a timestamp that meets both of the following criteria:
●The timestamp is the closest to the first timestamp.
●The timestamp is strictly prior or equal to the first timestamp.



WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, timestamp, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
Try it live on
https://demo.questdb.io

Some things we are trying out next for performance
●Compression, and exploring data formats like arrow/ parquet

●Own ingestion protocol

●Second level partitioning

●Improved vectorization of some operations (group by multiple columns or by expressions

●Add specific joins optimizations (index nested loop joins, for example)

QuestDB OSS
Open Source. Self-managed. Suitable for
production workloads.
https://github.com/questdb/questdb


QuestDB Enterprise
Licensed. Self-managed. Enterprise features like
RBAC, compression, replication, TLS on all
protocols, cold storage, K8s operator…
https://questdb.io/enterprise/

QuestDB Cloud
Fully managed, pay per usage environment,
with enterprise-grade features.
https://questdb.io/cloud/

OSA CON | December 12-14 2023
Q&A
●github.com/questdb/questdb
●https://questdb.io
●https://demo.questdb.io
●https://github.com/javier/questdb-quickstart
●https://slack.questdb.io/
40
Javier Ramirez
@supercoco9
We ?????? contributions
and GitHub ⭐ stars