Essential concepts of data architectures

mbrambil 28 views 20 slides Sep 24, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

The basic concepts that are needed to understand relational and non-relational database architectures


Slide Content

SYSTEMS AND METHODS FOR BIG AND UNSTRUCTURED DATA
Data Architecture Concepts
Marco Brambilla
[email protected]
@marcobrambi

Schema

The data schema
•Typing
•Coherence
•Uniformity

Transactions
•The relational world
•Multi-user
•Distributed systems

Definition of Transaction
•An elementary unit of work performed by an application
•Each transaction is encapsulated within two commands:
•begin transaction(bot) and end transaction(eot)
•Within a transaction one of the commands below is executed (exactly once):
•commit work(commit) and rollback work(abort)
•Transactional System(OLTP): a system capable of providing the definition
and execution of transactions on behalf of multiple, concurrent applications
•As opposed to OLAP

Application and Transaction
begin T1
begin T2
end T1
end T2
Transaction
T2
Transaction
T1
Application
program
Actions
Actions

Transaction: Example
begin transaction;
update Account
set Balance = Balance + 10 where AccNum= 12202;
update Account
set Balance = Balance –10 where AccNum= 42177;
commit work;
end transaction;

Transaction: Example with Alternative
begin transaction;
update Account
set Balance = Balance + 10 where AccNum = 12202;
update Account
set Balance = Balance –10 where AccNum = 42177;
select Balance into A from Account
where AccNum = 42177;
if (A>=0)then commit work
else rollback work;
end transaction;

Well-formed Transactions
•begin transaction
•code for data manipulation (reads and writes)
•commit work–rollback work
•no data manipulation
•end transaction
S0S1W1S2W2S3W3SrW4

Partitioning and
Replication

Horiz. Vs. Vert. Partitioning
•The distinctionofhorizontalvsverticalcomesfrom the traditional
tabular viewof a database.

Data Partitioning
Aim: scalability, distribution.
Partitioning splits the data in the database and partitions pieces of it to
different storage nodes.
Databases can be shardedhorizontally (by rows) or vertically (by columns).
Seealso: Sharding(ashorizontalpartitioning).
C
B
A
SPLIT
C
B
A

Replication
Aim: fault-tolerance, backup
Replication copies the entire database across all nodes in the
distributed system.
COPY

Partitioning + Replication
Possible?
COPY
C
B
A
SPLIT
C
B
A
C
B
A
SPLIT
C
B
AC
B
A

Pros / Cons of Each
PartitioningReplication
Pros
Fastdata writing /
reading. Low memory
overhead.
Fast data reading.
High data reliability.
Cons
Potential datalossHigh network
overhead.High
memory overhead.

Scale and Ingestion

Scalability
•How big is big?
•Not only scaling up
•Elasticity

Data Ingestion
•The process of importing,
transferring and loading data for
storage and later use
•It involves loading data from a
variety of sources
•It can involve altering and
modification of individual files to fit
into a format that optimizes the
storage
•For instance, in Big Data small files
are concatenated to form files of
100s of MBs and large files are
broken down in files of 100s of MB.
Data
source #1
bus
Data
source #2
Data
source #n[…]
Integration services
Transportation management, message
routing, message brokering,
transaction management, security, and
transformation.

Data Wrangling
The process of cleansing "raw" data
and transforming raw it into data that
can be analysed to generate valid
actionable insights.
It includes understanding, cleansing,
augmenting and shaping data.
The results is data in the best format
(e.g., columnar) for the analysis to
perform.
Understand
What types of data are
there, qualitative,
quantitative, categorical?
Is there any outliner?
Is any data missing?
Cleanse
remove outliners
adding missing values
transform quantitative
values in categorical ones
Augment
Aggregate data within one
data source
Add data from other data
sources based on
exact/fuzzy matches
Shape
Give data the best format
for the analysis to perform

SYSTEMS AND METHODS FOR BIG AND UNSTRUCTURED DATA
Data Architecture Concepts
Marco Brambilla
[email protected]
@marcobrambi