Cassandra Data Modelling

knoldus 995 views 45 slides May 03, 2018

Slide 1 of 45

About This Presentation

Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. The slides explain the details of data modeling in Cassandra.

Size: 1.43 MB

Language: en

Added: May 03, 2018

Slides: 45 pages

Slide Content

Cassandra Data
Modelling

Agenda
●Relational Data Model
●Cassandra Table Structure
●Cassandra Keys
●Cassandra Data Modeling
●Materialized Views

“ Relational Databases have been one of the most successful
applications in history. It’s used everywhere from individuals to
small firms, and in giant multinational corporations with clusters
of hundreds of finely tuned instances representing multi-terabyte
data warehouses. Relational databases store invoices, customer
records, product catalogues, accounting ledgers, user
authentication schemes—the very world, it might appear. There
is no question that the relational database is a key facet of the
modern technology and business landscape, and one that will be
with us in its various forms for many years to come “
RDBMS

Relational Data Model

●SQL (Standard) for DDL and DML.
●Normalized form of Data
●Allows user to represent complex relationshipswith the data.
●Supports ACID Transactions
●“Joins” a really powerful way to extract meaningful data.

RDBMS

● RDBMS Data Model

Why NoSql Then ?

●Scalability (ACID for transaction support needs locking).
●Vertical Scalability easy but horizontal scalability a huge
problem.
●Schema Oriented (What if our data structure is not fixed)
●Joins Performance becomes a bottleneck.
●Distributed Design is harder.

RDBMS Problems

Cassandra Data Model

Simplest Data Store
Name 1 Name 2 Name 3
Value 1 Value 2 Value 2

● ● Simplest Data Store
ROW
Column 1 Column 2 Column 3
Value 1 Value 2 Value 2
PRIMARY KEY

ROW 1
Column 1 Column 2 Column 3
Value 1 Value 2 Value 2
PRIMARY KEY
ROW 2
Column 1 Column 3
Value 1 Value 2
PRIMARY KEY
TABLE

Wide Row
PARTITION
KEY
TABLE
STATIC COLUMN
VALUE
Clustering
Key
Column
Value Value
Clustering
Key
Column
Value Value
Wide Row
PARTITION
KEY
STATIC COLUMN
VALUE
Clustering
Key
Column
Value Value
Column
Value
Column
Value

CREATE TABLE raw_weather_data (
weatherId text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY ((weatherId), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC,
hour DESC);
Practical Example

INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘1000001’,2005,12,1,7,-5.3);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,8,-4.8);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,9,-5.1);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,10,-5.6);
Practical Example

Clustering Order

●The column, which is a name/value pair
●The row, which is a container for columns referenced by a
primary key
●The table, which is a container for rows
●The keyspace, which is a container for tables
●The cluster, which is a container for keyspaces that spans
one or more nodes
Cassandra Data Structures

CREATE TABLE playlists (
id uuid,
song_order int,
song_id uuid,
title text,
album text,
artist text,
PRIMARY KEY (id, song_order ) );
Cassandra Keys

Cassandra Keys
PRIMARY KEY (id ,
song_order ) ;
Partition Key
Clustering Key
The partition key determines which
node stores the data.It is responsible
for data distribution across the
nodes.
The additional columns determine
per-partition clustering. Clustering is
a storage engine process that sorts
data within the partition.

Domain - Hotel Reservations
The conceptual domain includes :
●Hotels
●Guests
●Rooms
●Room Rates
●Reservations
●Point of Interest
●Geolocation data for hotels and POI
Conceptual Data Modeling

ER Diagram

●No Joins

●No Referential Integrity
●Denormalization
●Query First Design
●Designing for Optimal Storage
●Sorting is a Design Decision
Key Considerations

We do not have Joins in Cassandra, now what to do ?
2 Options -
●Do the work on client side and handle it through code :-)
●Create a new denormalized table which represents the
Join data (Preferred Approach)
No Joins

Cassandra does not enforce referential integrity but, it is
still a common design requirement to store IDs related to
other entities in your tables, but operations such as
cascading deletes are not available
No Referential Integrity

Denormalization is not evil.
Even in RDBMS as the data goes bigger and bigger the Join
performance becomes really slow and we have to
denormalize to reduce the number of joins to achieve a
reasonable query performance.
We may also have use cases where we have to keep it in
denormalized form (e.g. Invoice Data)
Space is cheaper and Cassandra performs best when the
data is in denormalized form.
Denormalization

Cassandra you don’t start with the data model; you start
with the query model.
●Instead of modeling the data first and then writing
queries, with Cassandra you model the queries and let
the data be organized around them.
●Think of the most common query paths your application
will use, and then create the tables that you need to
support them.
Query First Design

We have to think about how we are storing the data in
Cassandra as its tables are each stored as a separate files
on Disk.
●Keep the related columns defined together in same
table. You can also think of keeping frequently used
columns in one table and less used in other table.
●Always remember we have to minimize the number of
partitions read to get the query results.
Designing for Optimal Storage

In RDBMS you do not have a default sort order and you can
do sorting on any column or list of columns by Order By
clause. In Cassandra it is a design decision :
●The sort order available on queries is fixed and is
determined entirely by Clustering Columns defined
during create statement.
●CQL SELECT statement does support ORDER BY
semantics, but only in the order specified by the
clustering columns.
Sorting is a Design Decision

Queries Related to Hotel :
Q1. Find hotels near a given point of interest.
Q2. Find information about a given hotel, such as its
name and location.
Q3. Find points of interest near a given hotel.
Q4. Find an available room in a given date range.
Q5. Find the rate and amenities for a room.
Lets Focus on Application Queries

Queries Related to Reservation :
Q6. Lookup a reservation by confirmation number.
Q7. Lookup a reservation by hotel, date, and guest
name.
Q8. Lookup all reservations by guest name.
Q9. View guest details.
Lets Focus on Application Queries

Application Queries

Approach for Modeling
1.Identify the Primary Entity that will be needed to satisfy
the Query e.g. for Q1 we need to find hotels near POI so
we have to query hotels and POI.
2.Identify the Primary Key based on the Query (so for Q1 it
seems like ID of the POI entity) also identify the
clustering keys to guarantee unique records and sorting.
3.Identify the other attributes that will be be needed to
satisfy our need (additional columns)

Approach for Modeling
Lets focus on Q1 : View Hotels near POI
-Since we need to search hotels near POI there will be
two entities involved Hotel and POI
-Now we have to select hotel for a POI we will surely need
to add POI Name as a primary Key.
-For a POI there can be multiple hotels , so we need
HotelID as well. Hence we need it a part of our key as a
clustering column.
-We need to show Hotels to the user so Name, Address,
Phone are additional attributes that we will want to fetch
as a part of Q1.

First table
Table for Q1
hotels_by_poi
poi_name K
hotel_id C
name
phone
address

Hotel Logical Data Model

Reservation Logical Data Model

Hotel Physical Data Model

Reservation Physical Data Model

Materialized Views
In Cassandra you can not add a where clause for a column
that is not part of the key, here we can either create
secondary indexes or materialized views.
Materialized views are preconfigured views that support
queries on additional columns which are not part of the
original clustering key.
They simplify application development: instead of the
application having to keep multiple denormalized tables in
sync, Cassandra takes on the responsibility of updating
views in order to keep them consistent with the base table.

Important Aspects
Partition Size
- Number of Columns (2 billion columns)
Formula : N
v
= N
r
( N
c
− N
pk
− N
s
) + N
s
The number of values (or cells) in the partition (N
v
) is equal
to the number of static columns (N
s
) plus the product of the
number of rows (N
r
) and the number of of values per row.
The number of values per row is defined as the number of
columns (N
c
) minus the number of primary key columns (N
pk
)
and static columns (N
s
).

References
1. Cassandra: The Definitive Guide, 2nd Edition
Distributed Data at Web Scale
By Eben Hewitt, Jeff Carpenter
2. Introduction to Data Modeling with Apache Cassandra
Patrick McFadin
Chief Evangelist for Apache Cassandra

Cassandra Data Modelling

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Cassandra Data Modelling

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx