Timeplus: One single binary to tackle streaming and historical analytics
chloewilliams62
149 views
21 slides
Oct 11, 2024
Slide 1 of 21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
About This Presentation
Timeplus talk during SFTechWeek Event on Oct 8.
Size: 1.17 MB
Language: en
Added: Oct 11, 2024
Slides: 21 pages
Slide Content
SF TechWeek : Data & AI Edition
Ken Chen, Co-founder & Chief Architect
Oct 8, 2024
Proton - one single binary to tackle
streaming and historical analytics
Agenda
•Two Quick Demos
•What kind of problems Timeplus Proton resolves
•Architecture
•Storage
•Query Processing
•Python Extension
•Integrate with ML / AI Use Cases Study
•QA
Demo 2 : Ingest, Materialize, Serving (All in one binary)
Proton Source
Stream
Proton Materialized View
(Window Aggregation)
Proton Target
Stream
Data Gen
What’s Timeplus Proton
•Core problems to solve :
“Query, detect and act on fast-changing stream data in an
incremental way.”
•Key Tech Highlights
•Unifies streaming processing and historical analytics via SQL
•Dual data stores to handle data persistence
•Materialized incremental computation engine
•Everything is in single binary, light-weight and efficient , run from edge
to cloud, from single instance to distributed cluster
CONFIDENTIAL
Proton Architecture
Timeplus Proton
(Edge / Cloud)
Incremental ML / AI
CONFIDENTIAL
Proton Architecture - Zoom In
Unified Incremental SQL Processing Engine (Vectorized / JIT)
NativeLog
(Multi-Raft Replicated)
Query processing super power
(Applied state of the art
vectorized / JIT capabilities)
V8 Engine
(JavaScript)
CPython
Engine
(Cloud) Historical StoreNativeLog - Streaming Store
Optimized for concurrent ingest / latency Optimized for bulk scan / throughput
SQL via TCP, JSON Rest via HTTP
Ingest Query
API Layer
CONFIDENTIAL
Streaming Storage – NativeLog (Zoom In)
9
Replicated
LogManager
MetaStore
(Replicated)
NativeLog
Server
Loglet
Loglet
Segment
sequence
index
atime
index
etime
index
Client
ReplicatedLog
Segment
…
Query - One SQL to Rule All Analytics
Powerful windowing, aggregation and joining etc for streaming and historical query
SUBSCRIBE TO
SELECT device_name, avg(temperature), predict(temperature)
FROM tumble(devices, 5s)
INNER JOIN table(device_products)
ON devices.product_id = device_products.id
GROUP BY device_name, window_end
[SETTINGS seek_to = ‘2021-12-02 10:00:00’]
EMIT [AFTER watermark | LAST 30m]
1. Just one declarative SQL query
2. The most
advanced streaming
windowing & global
functions
5. Intelligent watermark control can
handle late events and time skew issues
properly
7. JOIN between
stream and stream, or
stream and table can
drive more real-time
analytics insights
6. LAST X can help user focusing on what’s
happening in recent time window
9. Super fast push-based stateful query through TCP.
No more “Refresh”!
4. Time can be easily
rewinded to any
historical moment for
reprocessing
8. UDA/F for customer functions, e.g. ML
prediction or anomaly detection, customized
aggregations, Complex Event Processing
3. Connect historical
data via table()
CONFIDENTIAL
Extended SQL / Execution Plans
13
StreamSource HistoricalSource
BuildingHashTable
Joining
Transform
Watermark
Transform
WindowAssigment
Transform
TumbleWindowAggr
Transform
Output Project
WITH joined AS
(
SELECT
*
FROM devices
INNER JOIN table(device_inventory) AS
device_inventory ON id = device_inventory.device_id
)
SELECT
max_k(cpu, device_name), window_start
FROM
tumble(joined, 5s)
GROUP BY window_start
CONFIDENTIAL
Query Processing - Historical Backfill
EntryEntryEntry Block Block Block Block
Historical Data
Reader (Backfill)
Streaming Data
Reader (Live)
Concatenated Stream Reader
Last sequence number
(1)
(2)
(3)
Downstream
Pipeline
NativeLog Historical Data Store
●Streaming Lookup
●Streaming Enrichment
●Traveling across historical
store and streaming events
●Correlated windows analytics:
Compare live window with
historical window
sn
SELECT count(), max(cpu), max(memory)
FROM devices WHERE _tp_time > ‘2022-11-12 00:00:00’ GROUP by device_id;
CONFIDENTIAL
Native Python Integration (Python UDF)
CREATE FUNCTION mask_password(values string) RETURN string
LANGUAGE PYTHON AS
$$
import re
for value in values:
value[i]=re.sub(‘password=(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}’, ‘password=***’, v);
return value;
$$;
CONFIDENTIAL
Native Python Integration (Python UDAF)
CREATE AGGREGATE FUNCTION second_max(value double) RETURN double
LANGUAGE PYTHON AS
$$
def initialize():
pass
def process(values):
pass
def finalize():
pass
}
$$;
CONFIDENTIAL
Case Study : Real-time DDos Detection
https://www.timeplus.com/post/real-time-ddos-detection
CONFIDENTIAL
Case Study : Real-time Machine Learning
https://www.timeplus.com/post/real-time-machine-learning
Recap - Timeplus Proton
•Core problems to solve :
“Query, detect and act on fast-changing stream data in an
incremental way.”
•Key Tech Highlights
•Unifies streaming processing and historical analytics via SQL
•Dual data stores to handle data persistence
•Materialized incremental computation engine
•Everything is in single binary, light-weight and efficient , run from edge
to cloud, from single instance to distributed cluster
Thank you!
CONFIDENTIAL
Timeplus Cluster - Single Binary with Multiple-Roles
(Replicated NativeLog, powered by Multi-Raft)
Data Node
shard-1 shard-3shard-2
Data Node
shard-1 shard-3shard-2
Data Node
shard-1 shard-3shard-2
Data Ingestion
Node
Data Query Node
Data Ingestion
Node
Data Ingestion
Node
Data Query Node
Data Query Node
Query Query QueryIngestion Ingestion Ingestion
Client
replica = 3
Access
Layer
Computing
Layer
Data
Layer
Data
Replication
and parallel
access
High
Availability
Query and
Ingestion and
horizontal
scalability
High
Availability
Access and
Load
Balancing
Metadata
Node
Metadata
Node
Metadata
Node
CONFIDENTIAL
Timeplus Cluster - Query Failover
Data Node
shard-1 shard-3shard-2
Data Node
shard-1 shard-3shard-2
Data Node
shard-1 shard-3shard-2
Data Query Node Data Query Node
Query
Query
Client
replica = 3
checkpoint checkpoint checkpoint
Access
Layer
Computing
Layer
Data
Layer
Data
Replication
and parallel
access
High
Availability
Query and
Ingestion and
horizontal
scalability
High
Availability
Access and
Load
Balancing