Cassandra Data Modelling

knoldus 995 views 45 slides May 03, 2018
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. The slides explain the details of data modeling in Cassandra.


Slide Content

Cassandra Data
Modelling

Agenda
●Relational Data Model
●Cassandra Table Structure
●Cassandra Keys
●Cassandra Data Modeling
●Materialized Views

“ Relational Databases have been one of the most successful
applications in history. It’s used everywhere from individuals to
small firms, and in giant multinational corporations with clusters
of hundreds of finely tuned instances representing multi-terabyte
data warehouses. Relational databases store invoices, customer
records, product catalogues, accounting ledgers, user
authentication schemes—the very world, it might appear. There
is no question that the relational database is a key facet of the
modern technology and business landscape, and one that will be
with us in its various forms for many years to come “
RDBMS

Relational Data Model

●SQL (Standard) for DDL and DML.
●Normalized form of Data
●Allows user to represent complex relationshipswith the data.
●Supports ACID Transactions
●“Joins” a really powerful way to extract meaningful data.

RDBMS

● RDBMS Data Model

Why NoSql Then ?

●Scalability (ACID for transaction support needs locking).
●Vertical Scalability easy but horizontal scalability a huge
problem.
●Schema Oriented (What if our data structure is not fixed)
●Joins Performance becomes a bottleneck.
●Distributed Design is harder.

RDBMS Problems

Cassandra Data Model

Cassandra Data Model

Simplest Data Store
Name 1 Name 2 Name 3
Value 1 Value 2 Value 2

● ● Simplest Data Store
ROW
Column 1 Column 2 Column 3
Value 1 Value 2 Value 2
PRIMARY KEY

ROW 1
Column 1 Column 2 Column 3
Value 1 Value 2 Value 2
PRIMARY KEY
ROW 2
Column 1 Column 3
Value 1 Value 2
PRIMARY KEY
TABLE

Wide Row
PARTITION
KEY
TABLE
STATIC COLUMN
VALUE
Clustering
Key
Column
Value Value
Clustering
Key
Column
Value Value
Wide Row
PARTITION
KEY
STATIC COLUMN
VALUE
Clustering
Key
Column
Value Value
Column
Value
Column
Value

CREATE TABLE raw_weather_data (
weatherId text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY ((weatherId), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC,
hour DESC);
Practical Example

INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘1000001’,2005,12,1,7,-5.3);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,8,-4.8);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,9,-5.1);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,10,-5.6);
Practical Example

Clustering Order

●The column, which is a name/value pair
●The row, which is a container for columns referenced by a
primary key
●The table, which is a container for rows
●The keyspace, which is a container for tables
●The cluster, which is a container for keyspaces that spans
one or more nodes
Cassandra Data Structures

CREATE TABLE playlists (
id uuid,
song_order int,
song_id uuid,
title text,
album text,
artist text,
PRIMARY KEY (id, song_order ) );
Cassandra Keys

Cassandra Keys
PRIMARY KEY (id ,
song_order ) ;
Partition Key
Clustering Key
The partition key determines which
node stores the data.It is responsible
for data distribution across the
nodes.
The additional columns determine
per-partition clustering. Clustering is
a storage engine process that sorts
data within the partition.

Domain - Hotel Reservations
The conceptual domain includes :
●Hotels
●Guests
●Rooms
●Room Rates
●Reservations
●Point of Interest
●Geolocation data for hotels and POI
Conceptual Data Modeling

ER Diagram

●No Joins

●No Referential Integrity
●Denormalization
●Query First Design
●Designing for Optimal Storage
●Sorting is a Design Decision
Key Considerations

We do not have Joins in Cassandra, now what to do ?
2 Options -
●Do the work on client side and handle it through code :-)
●Create a new denormalized table which represents the
Join data (Preferred Approach)
No Joins

Cassandra does not enforce referential integrity but, it is
still a common design requirement to store IDs related to
other entities in your tables, but operations such as
cascading deletes are not available
No Referential Integrity

Denormalization is not evil.
Even in RDBMS as the data goes bigger and bigger the Join
performance becomes really slow and we have to
denormalize to reduce the number of joins to achieve a
reasonable query performance.
We may also have use cases where we have to keep it in
denormalized form (e.g. Invoice Data)
Space is cheaper and Cassandra performs best when the
data is in denormalized form.
Denormalization

Cassandra you don’t start with the data model; you start
with the query model.
●Instead of modeling the data first and then writing
queries, with Cassandra you model the queries and let
the data be organized around them.
●Think of the most common query paths your application
will use, and then create the tables that you need to
support them.
Query First Design

We have to think about how we are storing the data in
Cassandra as its tables are each stored as a separate files
on Disk.
●Keep the related columns defined together in same
table. You can also think of keeping frequently used
columns in one table and less used in other table.
●Always remember we have to minimize the number of
partitions read to get the query results.
Designing for Optimal Storage

In RDBMS you do not have a default sort order and you can
do sorting on any column or list of columns by Order By
clause. In Cassandra it is a design decision :
●The sort order available on queries is fixed and is
determined entirely by Clustering Columns defined
during create statement.
●CQL SELECT statement does support ORDER BY
semantics, but only in the order specified by the
clustering columns.
Sorting is a Design Decision

Queries Related to Hotel :
Q1. Find hotels near a given point of interest.
Q2. Find information about a given hotel, such as its
name and location.
Q3. Find points of interest near a given hotel.
Q4. Find an available room in a given date range.
Q5. Find the rate and amenities for a room.
Lets Focus on Application Queries

Queries Related to Reservation :
Q6. Lookup a reservation by confirmation number.
Q7. Lookup a reservation by hotel, date, and guest
name.
Q8. Lookup all reservations by guest name.
Q9. View guest details.
Lets Focus on Application Queries

Application Queries

Approach for Modeling
1.Identify the Primary Entity that will be needed to satisfy
the Query e.g. for Q1 we need to find hotels near POI so
we have to query hotels and POI.
2.Identify the Primary Key based on the Query (so for Q1 it
seems like ID of the POI entity) also identify the
clustering keys to guarantee unique records and sorting.
3.Identify the other attributes that will be be needed to
satisfy our need (additional columns)

Approach for Modeling
Lets focus on Q1 : View Hotels near POI
-Since we need to search hotels near POI there will be
two entities involved Hotel and POI
-Now we have to select hotel for a POI we will surely need
to add POI Name as a primary Key.
-For a POI there can be multiple hotels , so we need
HotelID as well. Hence we need it a part of our key as a
clustering column.
-We need to show Hotels to the user so Name, Address,
Phone are additional attributes that we will want to fetch
as a part of Q1.

First table
Table for Q1
hotels_by_poi
poi_name K
hotel_id C
name
phone
address

Hotel Logical Data Model

Reservation Logical Data Model

Hotel Physical Data Model

Reservation Physical Data Model

Materialized Views
In Cassandra you can not add a where clause for a column
that is not part of the key, here we can either create
secondary indexes or materialized views.
Materialized views are preconfigured views that support
queries on additional columns which are not part of the
original clustering key.
They simplify application development: instead of the
application having to keep multiple denormalized tables in
sync, Cassandra takes on the responsibility of updating
views in order to keep them consistent with the base table.

Important Aspects
Partition Size
- Number of Columns (2 billion columns)
Formula : N
v
= N
r
( N
c
− N
pk
− N
s
) + N
s
The number of values (or cells) in the partition (N
v
) is equal
to the number of static columns (N
s
) plus the product of the
number of rows (N
r
) and the number of of values per row.
The number of values per row is defined as the number of
columns (N
c
) minus the number of primary key columns (N
pk
)
and static columns (N
s
).

References
1. Cassandra: The Definitive Guide, 2nd Edition
Distributed Data at Web Scale
By Eben Hewitt, Jeff Carpenter
2. Introduction to Data Modeling with Apache Cassandra
Patrick McFadin
Chief Evangelist for Apache Cassandra

Thank You