The basic concepts that are needed to understand relational and non-relational database architectures
Size: 884.56 KB
Language: en
Added: Sep 24, 2024
Slides: 20 pages
Slide Content
SYSTEMS AND METHODS FOR BIG AND UNSTRUCTURED DATA
Data Architecture Concepts
Marco Brambilla [email protected]
@marcobrambi
Schema
The data schema
•Typing
•Coherence
•Uniformity
Transactions
•The relational world
•Multi-user
•Distributed systems
Definition of Transaction
•An elementary unit of work performed by an application
•Each transaction is encapsulated within two commands:
•begin transaction(bot) and end transaction(eot)
•Within a transaction one of the commands below is executed (exactly once):
•commit work(commit) and rollback work(abort)
•Transactional System(OLTP): a system capable of providing the definition
and execution of transactions on behalf of multiple, concurrent applications
•As opposed to OLAP
Application and Transaction
begin T1
begin T2
end T1
end T2
Transaction
T2
Transaction
T1
Application
program
Actions
Actions
Transaction: Example
begin transaction;
update Account
set Balance = Balance + 10 where AccNum= 12202;
update Account
set Balance = Balance –10 where AccNum= 42177;
commit work;
end transaction;
Transaction: Example with Alternative
begin transaction;
update Account
set Balance = Balance + 10 where AccNum = 12202;
update Account
set Balance = Balance –10 where AccNum = 42177;
select Balance into A from Account
where AccNum = 42177;
if (A>=0)then commit work
else rollback work;
end transaction;
Well-formed Transactions
•begin transaction
•code for data manipulation (reads and writes)
•commit work–rollback work
•no data manipulation
•end transaction
S0S1W1S2W2S3W3SrW4
Partitioning and
Replication
Horiz. Vs. Vert. Partitioning
•The distinctionofhorizontalvsverticalcomesfrom the traditional
tabular viewof a database.
Data Partitioning
Aim: scalability, distribution.
Partitioning splits the data in the database and partitions pieces of it to
different storage nodes.
Databases can be shardedhorizontally (by rows) or vertically (by columns).
Seealso: Sharding(ashorizontalpartitioning).
C
B
A
SPLIT
C
B
A
Replication
Aim: fault-tolerance, backup
Replication copies the entire database across all nodes in the
distributed system.
COPY
Partitioning + Replication
Possible?
COPY
C
B
A
SPLIT
C
B
A
C
B
A
SPLIT
C
B
AC
B
A
Pros / Cons of Each
PartitioningReplication
Pros
Fastdata writing /
reading. Low memory
overhead.
Fast data reading.
High data reliability.
Cons
Potential datalossHigh network
overhead.High
memory overhead.
Scale and Ingestion
Scalability
•How big is big?
•Not only scaling up
•Elasticity
Data Ingestion
•The process of importing,
transferring and loading data for
storage and later use
•It involves loading data from a
variety of sources
•It can involve altering and
modification of individual files to fit
into a format that optimizes the
storage
•For instance, in Big Data small files
are concatenated to form files of
100s of MBs and large files are
broken down in files of 100s of MB.
Data
source #1
bus
Data
source #2
Data
source #n[…]
Integration services
Transportation management, message
routing, message brokering,
transaction management, security, and
transformation.
Data Wrangling
The process of cleansing "raw" data
and transforming raw it into data that
can be analysed to generate valid
actionable insights.
It includes understanding, cleansing,
augmenting and shaping data.
The results is data in the best format
(e.g., columnar) for the analysis to
perform.
Understand
What types of data are
there, qualitative,
quantitative, categorical?
Is there any outliner?
Is any data missing?
Cleanse
remove outliners
adding missing values
transform quantitative
values in categorical ones
Augment
Aggregate data within one
data source
Add data from other data
sources based on
exact/fuzzy matches
Shape
Give data the best format
for the analysis to perform
SYSTEMS AND METHODS FOR BIG AND UNSTRUCTURED DATA
Data Architecture Concepts
Marco Brambilla [email protected]
@marcobrambi