Objectives
key terms in the distributed database area
Distributed vs. Decentralized Database
Homogenous vs. Heterogeneous Decentralized Database
Location transparency vs. local autonomy
Asynchronous vs. Synchronous distributed databases
Horizontal vs. Vertical partitioning
Full refresh vs. differential refresh
Push replication vs. Pull replication
Local transaction vs. Global Transaction
Objectives
Describe salient characteristics of distributed database
environments
Explain advantages and risks of distributed databases
Explain strategies and options for distributed database
design
Discuss synchronous and asynchronous data replication
and partitioning
Discuss optimized query processing in distributed
databases
Distributed vs. Decentralized Database
Both are stored on computers in multiple locations
Distributed Database
Geographical distribution of a SINGLE
database
Decentralized Database
A collection of independent databases on non-
networked computers
Users at various sites cannot share data
Distributed Database
Require multiple DBMSrunning at
remote sites
There are different types of distributed
database environments
The degree to which these DBMS cooperate
Having a master site to coordinate requests
involving data from multiple sites
Reasons for Distributed Database
Distribution and Autonomy of Business Units
Departments/Facilities are geographically distributed
Each has the authority to create and control own data
Business mergers create this environment
Data sharing
Consolidate data across local databases on demand.
Data communication costs and reliability
Economical and reliable to locate data where needed.
High cost for remote transactions / large data volumes
Dependence on data communications can be risky
Reasons for Distributed Database
Multiple application vendor environment
Each unit may have different vendor applications
A distributed DBMS can provide functionality that
cuts across separate applications
Database recovery
Replicating data on separate computers may ensure
that a damaged database can be quickly recovered
Homogeneous vs. Heterogeneous
Distributed Database
Homogeneous Distributed Database -
The same DBMSis used at each node
Difficult for most organizations to force a
homogeneous environment
Heterogeneous Distributed Database
Potentially different DBMSare used at each
node
Much more difficult to manage
Typical Homogeneous Environment
Data distributed across all the nodes.
Same DBMS at each node.
A central DBMS coordinates database access
and update across the notes
No exclusively local data
All access is through one, global schema.
The global schema is the unionof all the local
schema.
Identical DBMSs
Figure 13-2 –Homogeneous Database
Everyone is a
GLOBALuser
Typical Heterogeneous Environment
Data distributed across all the nodes.
Different DBMSs may be used at each node.
Local access is done using the local DBMS
and schema.
Remote access is done using the global
schema.
Figure 13-3 –Typical Heterogeneous Environment
Non-identical DBMSs
Local user
accesses his
own data
Major Objectives of Distributed Database
Allow users to share data yet be able to operate
independently when network link fails.
Location Transparency
User does not have to know the location of the data
Data requests automatically forwarded to appropriate
sites
Local Autonomy
Local site can operate with its database when network
connections fail
Each site controls its own data, security, logging,
recovery
Trade-Offs in Distributed Database
When do you update data across the database?
SynchronousDistributed Database
All copies of the same data are always identical
Updates apply immediately to all copies throughout network
Good for data integrity
High overhead slow response times
AsynchronousDistributed Database
Some data inconsistency is tolerated
Data update propagation is delayed
Lower data integrity
Less overhead faster response time
Advantages of Distributed Database
1.Increased reliability and availability
Even when a component fails the database may continue to
function albeit at a reduced level
2.Allow Local control over data.
Local control promotes data integrity and administration
3.Modular growth
Easy to add a connection to a new location
Less chance of disrupting existing users during expansion
4.Lower communication costs.
5.Faster response for certain queries.
Query local data
Parallel queries
Disadvantages of Distributed Database
Software cost and complexity.
Processing overhead.
Data integrity exposure.
Slower response for certain queries.
If data are not distributed properly, according to
their usage, or if queries are not formulated
correctly, queries can be extremely slow
Options for Distributing a Database
Data Replication
Horizontal Partitioning
Vertical Partitioning
Combinations of the above
Data Replication
Advantages
Reliability –if one node fails, you can find data at
another node
Fast response at sites that have a full copy
May avoid complicated distributed transaction
integrity routines (if replicated data is refreshed at
scheduled intervals.)
De-couples nodes -transactions proceed even if
some nodes are down.
Reduced network traffic at prime time, if updates
can be delayed to non-primetime hours
Data Replication
Disadvantages -
Storage requirements
Complexity and cost of updating.
Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously.
Data Replication
Best for non-volatile/static, non-collaborative
data
Catalogs
Telephone directories
Train Schedules
Not good for on-line applications
Airline reservations
ATM transactions
Types of Data Replication
Push Replication
Updating site sends changes to other sites
Pull Replication
Receiving sites control when update
messages will be processed
Types of Push Replication
Snapshot Replication
Changes periodically sent to master site
Master collects updates in log
Near Real-Time Replication
Broadcast update orders without requiring
confirmation
Update messages stored in message queue until
processed by receiving site
Issues in Data Replication Use
Data timeliness–high tolerance for out-of-date
data may be required
DBMS capabilities–if DBMS cannot support
multi-node queries, replication may be necessary
Performance implications –refreshingmay cause
performance problems for busy nodes
Network heterogeneity–complicates replication
Networkcommunication capabilities–complete
refreshes place heavy demand on
telecommunications
Horizontal Partitioning
Different rows of a table at different sites
Advantages -
Data stored close to where it is used efficiency
Local access optimization better performance
Only relevant data is available security
Unions across partitions ease of query
Disadvantages
Accessing data across partitions inconsistent
access speed
If no data replication backup vulnerability
Vertical Partitioning
Different columns of a table at different sites
Advantages and disadvantages are the same as
for horizontal partitioning except that
combining data across partitions is more
difficult because it requires joins(instead of
unions)
Factors in Choice of Distributed Strategy
No approach to data distribution is ALWAYS best
Choice depends on
Funding, autonomy, security.
Site data referencing patterns.
Growth and expansion needs.
Technological capabilities.
Costs of managing complex technologies.
Need for reliable service.
Distributed DBMS
Distributed databaserequires distributed DBMS
Functions of a distributed DBMS:
Locate data with a distributed data dictionary
Determine location from which to retrieve data and process
query components
DBMS translation between nodes with different local DBMSs
(handle heterogeneous DBMS translation using middleware)
Data consistency (via multiphase commit protocols)
Global primary key control
Scalability
Security, concurrency, query optimization, failure recovery
Distributed DBMS Data Reference
Local Transaction -references local data.
Global Transaction -references non-local data.
Distributed DBMS Architecture
Distributed DBMS Transparency Objectives
Location Transparency
User/application does not need to know where data resides
Replication Transparency
User/application does not need to know about duplication
Failure Transparency
Either all of the actions of a transaction are committed or else
none of them is committed.
If a transaction fails at one site it don’t commit at other sites
A system should detect a failure (broken communication link,
erroneous data, disk head crash), reconfigure the system and recover
Each site has a transaction manager
Logs transactions and before and after images
Requires special commit protocol
Failure Transparency Two-Phase Commit
Commit Protocol: Ensures that a global
transaction is either successfully completed at
each site or else aborted.
Two-Phase Commit
Prepare Phase: Check if operation ok at all
participating sites
Commit Phase: Only if all participating sites
agree, do you issue the commite
Distributed DBMS Transparency Objectives
Concurrency Transparency
Allow multiple users to run transactions
concurrently, with each transaction appears as if
it is the only activity in the system
Timestamping
Ensure that even if two events occur simultaneously
at different sites, each will have a unique timestamp.
Alternative to locksin distributed databases