Aggregations - The Elasticsearch "GROUP BY"

JohnSobanski 41 views 24 slides May 15, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Presented at Elastic's worldwide "Virtual Meetup" on 5/15/2024

https://www.youtube.com/watch?v=ayQap5pH_0w

https://john.soban.ski/aggregations-the-elasticsearch-group-by.html


Slide Content

Aggregations The Elasticsearch GROUP BY John Sobanski – VP of AI/ML

About Me John Sobanski VP AI/ML at Pyramid Systems Inc. Elastic Certified Engineer (2020, 2022) https://soban.ski

Agenda Background - Relational Database Management Systems Aggregations – The Elasticsearch GROUP BY Why use the Elasticsearch Aggregation API 1 st Demo - Execute GROUP BY operations via Elasticsearch aggregations Elasticsearch Aggregations Drive Time Series Data Visualization BUCKETS in Elasticsearch 2 nd Demo - Time Series Data Visualization with Kibana

Relational Database Management System

Series Operations 2 Sum 57 5 Product 40000 20 Min 2 20 Max 20 10 Median 10 Mean 11.4

RDBMS Series Operations In the traditional Relational Database Management System (RDBMS) world, SQL databases use GROUP BY syntax to group rows with similar values into summary rows. The query, "find the number of web page hits per country," for example, represents a typical GROUP BY operation. This table records the number of hits to my site [ https://soban.ski] , broken down by time zone.

Further summarize the table to record "hits per country" via a GROUP BY operation Collapse the Time zones into parent countries. SELECT COUNTRY, SUM (HITS) FROM timezone_hits GROUP BY COUNTRY; SQL Syntax:

RDBMS vs. Elasticsearch Even though Elasticsearch does not use the row construct to identify a unit of data (Elastic calls their rows Documents), we can still perform GROUP BY queries in Elasticsearch Elasticsearch names their  GROUP BY  queries  Aggregations . The Elasticsearch API provides an expressive REST API to execute  Aggregations Kibana also provides a Graphical User Interface (GUI) to execute  Aggregations ElasticSearch Relational Database Index Type Documents Fields Database Table Rows Columns

Aggregations – The Elasticsearch GROUP BY

Why use the Elasticsearch Aggregation API? - 1 Easy Approach: Query Elasticsearch, convert the returned JSON to a Pandas Dataframe , and then apply a Pandas GROUP BY to the Dataframe to retrieve summary stats. Modern laptops include 32GB of memory and you have had no issues with this method. If you use Elasticsearch for non time series data, e.g. static data for blogs, you may not need to worry about running out of memory.

Why use the Elasticsearch Aggregation API? - 2 In the future, you may deal with Big Data. If you collect time series data, such as access logs, or security logs, you might scale to Big Data. In that case, the Elasticsearch database size will exceed the memory of your laptop.

Why use the Elasticsearch Aggregation API? - 3 Aggs  API allows you to command Elasticsearch to execute the GROUP BY analogue in- stu  (a best practice), and then also apply the summary stats in place. Elasticsearch will then return the summary stats as JSON, and you will not run out of memory.

Time Series GROUP BY(RDMMS) and Aggregation(Elasticsearch) operations lend themselves well to Time Series data, since these operations allow you to GROUP BY or Aggregate results over a given time bucket (e.g. Hour, Day, Week, Month, etc.).

1 st Demo Execute GROUP BY operations via Elasticsearch aggregations Demonstrate how to generate tables via both Kibana and the Elasticsearch API.

Elasticsearch Aggregations Drive Time Series Data Visualization

Elasticsearch BUCKETS In Elasticsearch parlance, we put the rows into BUCKETS, with one BUCKET for each country. Consider a query for total hits. The Elasticsearch API reports that I received ~100k hits in the month of June. Add the hits to BUCKETS !

“Country” Buckets Use the AGGS API to count the hits by country.

“Day” Buckets Use the API to count the hits by day.

Time Buckets T Time BUCKETS Analyze, summarize and visualize time series data. BUCKETS return a smaller amount of data, for example, hits per hour We need BUCKETS because we cannot plot every datum from a big data document store

Why Use Time Buckets?

Why Use Time Buckets?

Demo – Time Series Data Viz with Kibana

Conclusion Summarize Series Sum, Median, etc. RDBMS Uses GROUP BY Elasticsearch Uses AGGS Separate data by tag via BUCKETS Country Buckets, Time Buckets, Month Buckets

Thank You