Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search

chloewilliams62 189 views 52 slides Oct 17, 2024
Slide 1
Slide 1 of 52
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52

About This Presentation

In this webinar, we’ll explain the powerful combination of time series data and vector similarity search to revolutionize urban traffic management. Learn how to transform raw sensor data from InfluxDB into meaningful vectors, enabling advanced pattern recognition and anomaly detection using Milvus...


Slide Content

| © Copyright 2023, InfluxData1
Time Series to Vectors:
Leveraging InfluxDB and
Milvus for Similarity Search
Anais Dotis Georgiou
October 2024

| © Copyright 2023, InfluxData22
Anais Dotis-Georgiou
Developer Advocate
LinkedIn

| © Copyright 2023, InfluxData3 | © Copyright 2023, InfluxData3
Agenda
●Introduction to InfluxDB and Time Series Databases
●TSDB vs Vector Databases: Apples to Oranges
●Projects you can try!
●Demo: Leveraging InfluxDB and Milvus for Similarity Search
for Time Series
●Use Cases
●(Time Permitting) Tools for data processing and ML tasks
with InfluxDB

| © Copyright 2023, InfluxData4
Introduction to InfluxDB and
Time Series Databases

| © Copyright 2023, InfluxData5
A Critical Component of Modern Data
Pipelines
Time Series
Data

| © Copyright 2023, InfluxData6
The age of instrumentation
Instrumentation
of the virtual world
(e.g. DevOps)
Sensors
in the physical world
(e.g. IoT)

| © Copyright 2023, InfluxData7
Time Series Data Types
Metrics
Events
Measurements at regular
time intervals
Measurements at irregular
time intervals

| © Copyright 2023, InfluxData8
Time series in every application
Infrastructure & data sources
Consumer & Industrial IoT Software Infrastructure
Renewable
&
alternative
energy
systems
Manufacturin
g & industrial
platforms
Fleet
management
& telematics
Real-time Applications
Developer
Tools
& APIs
Kubernete
s
(K8s)
DevOps
Monitoring
Gaming
Applications
Fintech
Applications
Network
Monitoring
TIME SERIES DATA

| © Copyright 2023, InfluxData9
Rise of time series as a category
TIME SERIESRELATIONAL DOCUMENT SEARCH
•Distributed
search
•Logs
•Geo
•High
throughput
•Large
document
•Orders
•Customers
•Records
•Events, metrics, time stamped
•for IoT, analytics, cloud native
Time series is fastest
growing data category by far
Time series
All others
source: DB Engines

| © Copyright 2023, InfluxData10
Time Series Databases
Time Series
Data
High write
throughput
Efficient
Queries Over
Time Ranges
Scalability
and
Performance

| © Copyright 2023, InfluxData11
InfluxDB 3.0

| © Copyright 2023, InfluxData12
Vector Databases

| © Copyright 2023, InfluxData13
New kid on the blog: Vector Databases

| © Copyright 2023, InfluxData14
TSDB vs Vector Databases

| © Copyright 2023, InfluxData15
TSDB Vector
Use Cases:
•Monitoring
•IoT
•Predictive Maintenance

Advantages:
•Optimized for Time Series Data
•Time-Based Aggregations
•Fast inserts and Queries
Use Cases:
•Similarity Search
•Machine Learning and AI

Advantages:
•Efficiency in High-Dimensional Data
•Similarity Searches
•Support for Complex Data Types

| © Copyright 2023, InfluxData16
TSDB Vector


ML with Vector DBs:
•Similarity Search
•Clustering
•Anomaly Detection
•Nearest Neighbor Classification

ML with TS DBs:
•Forecasting
•Time Series Classification
•Anomaly Detection
•Regression Analysis

| © Copyright 2023, InfluxData17
ML and Data Processing
Projects you can try!

| © Copyright 2023, InfluxData18
Querying Programmatically via Flight
from influxdb_client_3 import InfluxDBClient3



host = “eu-central-1-1.aws.cloud2.influxdata.com”
org="6a841c0c08328fb1"
token = “”
database = “database”

client = InfluxDBClient3(
token=token,
host=host,
org=org)


sql = '''SELECT * FROM table'''
df = client.query(query=sql, language='sql',
mode='pandas')
print(df)
Library Import
Initialization
Query

| © Copyright 2023, InfluxData19
From Time Series to Vectors
influxdata.com/blog/time-series-infl
uxdb-vector-database/

| © Copyright 2023, InfluxData20
Sensor
Data
Vehicle CountAverage Speed
Vector Database with InfluxDB
Video
Vectors
Traffic Anomaly
Embeddings

| © Copyright 2023, InfluxData21
Vectorizing Time Series Data for
Similarity Search

| © Copyright 2023, InfluxData22
Demo

| © Copyright 2023, InfluxData23
Saving the Holidays
github.com/InfluxCommunity/quix-s
aving-holidays

| © Copyright 2023, InfluxData24
?????? Packing Co is having recurring issues
with one of their packaging machines.

?????? Unexpectedly, 1 of the machines will enter
a failing state which requires a manual
reset by an engineer.

?????? The Plant Manager has advised, when
running normally all machine sensors will
follow similar output patterns. If a
machine is at fault these will fluctuate
abnormally.

?????? How can we use HiveMQ, HuggingFace and
InfluxDB to solve this?
?????? Packing Co — Anomaly Detection

| © Copyright 2023, InfluxData25
factory
Grafana
mlresultsmachine_data
MQTT
Client
Destination
Destination
Query
ML Model
?????? Solution Architecture
Data Processing Engine

| © Copyright 2023, InfluxData26
This could easily be
solved with
thresholding
In an ideal word

| © Copyright 2023, InfluxData27
What do we do when
our result becomes
unpredictable by
conventional means?

Realistically…

| © Copyright 2023, InfluxData28
Artificial Neural Networks - Autoencoder
i/p o/pBottleneck
Encoder
Decoder
inputs = Input(shape=(input_dim,))
sequences = SequenceLayer(timesteps)(inputs)
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(16, activation='relu', return_sequences=True)(inputs)
encoded = LSTM(4, activation='relu', return_sequences=False)(encoded)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(4, activation='relu', return_sequences=True)(decoded)
decoded = LSTM(16, activation='relu', return_sequences=True)(decoded)
decoded = TimeDistributed(Dense(input_dim))(decoded)

| © Copyright 2023, InfluxData29
CSTR
github.com/InfluxCommunity/CSTR_InfluxDB

| © Copyright 2023, InfluxData30
Use Cases: ML and InfluxDB

| © Copyright 2023, InfluxData31

| © Copyright 2023, InfluxData32
Overview of Tasks in InfluxDB 3.0
•InfluxDB 3.0 favors interoperability with other ETL and
stream processing tools instead of locking users into
InfluxDB specific task tooling.
•Users have access to a wide variety of streaming and task
tools, so they can find the one that works best for them.
•Having more choices requires greater initial
decision-making.

| © Copyright 2023, InfluxData33
Tools for Tasks

This lesson covers tools that we have created
demos and POCs with. It’s not an extensive list
of all the available ETL tools or solutions.

| © Copyright 2023, InfluxData34 | © Copyright 2023, InfluxData34
Mage.ai
●The open source alternative to Apache Airflow.
●An open source ETL tool.
●A UI that simplifies the ETL process.

| © Copyright 2023, InfluxData35
Advantages to Mage
•Open Source, easy to use.
•Mage features:
•Orchestration: Schedule and manage data pipelines with
observability.
•Notebook editor: Interactive Python, SQL, & R editor for coding data
pipelines.
•Data integration: Synchronize data from 3rd party sources to your
internal destinations.
•Streaming: Ingest and transform real-time data.
•dbt: Build, run, and manage your dbt models with Mage.
•Clear documentation on how to deploy on AWS, Azure,
DigitalOcean, and GCP with Terraform and Helm Charts.

| © Copyright 2023, InfluxData36
Resources for Mage and InfluxDB 3.0
•Mage.ai for Tasks with InfluxDB: A blog post highlighting how to
set up a simple downsampling task with Mage and InfluxDB 3.0.
•Mage for Anomaly detection with InfluxDB and Half-space
Trees: A blog post on performing anomaly detection with Mage and
InfluxDB 3.0.
•ETL Made Easy: Best Practices for Using InfluxDB and Mage.ai:
An on-demand webinar on best practices for using Mage as an ETL
tool with InfluxDB. Includes a demo on anomaly detection with Mage
and InfluxDB 3.0.
•Mage Documentation
•Mage_Demo: A containerized repo highlighting the anomaly
detection use case.

| © Copyright 2023, InfluxData37
?????? Mage & InfluxDB - Anomaly Detection

| © Copyright 2023, InfluxData38
Model: Half-Space Trees

| © Copyright 2023, InfluxData39
Try it yourself
https://github.com/InfluxCommunity
/Mage_Demo

| © Copyright 2023, InfluxData40
AWS Fargate
Serverless compute engine for containers that works with both
Amazon Elastic Container Service (ECS) and Amazon Elastic
Kubernetes (EKS).

| © Copyright 2023, InfluxData41
Advantages to Fargate
•Serverless Simplicity: Fargate abstracts away the underlying
infrastructure, allowing developers to deploy containers
without worrying about provisioning, scaling, or managing EC2
instances
•Cost Efficiency: Fargate charges users based on the resources
consumed by the containers, providing cost savings by
eliminating the need to maintain idle EC2 instances.

| © Copyright 2023, InfluxData42
Resources for Fargate and InfluxDB
•ricks-downsampler: a repo that contains a containerized
downsampler complete with scheduling options and some
monitoring.
•Saving AWS Costs by using Fargate Scheduling: a blog post
that compares the costs associated with:
1.Using Fargate to continuously run container that had built in
scheduling and some monitoring for a downsampling task with
InfluxDB.
2.Using Cloudwatch to schedule the runs. This was the more
expensive option for this use case where the runs were periodic and
on a consistent data load.

| © Copyright 2023, InfluxData43
Try it yourself
https://github.com/InfluxCommunity
/ricks-downsampler/

| © Copyright 2023, InfluxData44
ByteWax
Bytewax is a framework designed for building data processing
pipelines with a focus on streaming and stateful computations.
It is particularly suited for tasks that involve real-time data
processing, such as ETL (extract, transform, load) pipelines,
event-driven architectures, and continuous analytics.

| © Copyright 2023, InfluxData45
Advantages to ByteWax
•Stateful Computations
•Parallel and Distributed Execution
•Python Integration
•Windowing and Time-Based Operations
•Connectors and Integrations
•Event-Driven Architecture
•InfluxDB Source and Sink Connectors

| © Copyright 2023, InfluxData46
Kafka and Faust
Kafka and Faust are both tools used for building data pipelines
and stream processing systems. They each have unique
features and advantages that make them suitable for ETL
(Extract, Transform, Load) tasks.

| © Copyright 2023, InfluxData47
Advantages to Kafka and Faust
Kafka
•High Throughput and Low Latency
•Scalability
•Durability and Reliability
•Fault Tolerance
•Pub/Sub Messaging
Faust
•Pythonic API
•Stream Processing
•Ease of Use and Integration with Kafka

| © Copyright 2023, InfluxData48
Resources

| © Copyright 2023, InfluxData49
Join the InfluxDB Community
Sign up for Free
Influxdata.com/cloud

Via cloud marketplace
Learn
Blogs
Documentation
InfluxDB University
Community
https://influxdbu.com/
https://influxcommunity.slack.com /
https://community.influxdata.com/
https://www.influxdata.com/blog/
https://docs.influxdata.com/

| © Copyright 2023, InfluxData50
Any Questions?

| © Copyright 2023, InfluxData51
Get Help + Resources!
51
Website: https://www.influxdata.com/
Get started with 3.0: https://cloud2.influxdata.com/signup
Forums: community.influxdata.com
Docs: docs.influxdata.com
Blogs: influxdata.com/blog
InfluxDB University: influxdata.com/university

| © Copyright 2023, InfluxData52
T H A N K Y O U
Thank you
Tags