Advanced Topics on Database - Unit-1 AU17

LOGANATHANK24 25 views 70 slides Sep 10, 2024
Slide 1
Slide 1 of 70
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70

About This Presentation

Advanced Topics on Database - Unit-1 AU17


Slide Content

UNIT-1

Distributed Data Processing Distributed data processing allows multiple computers to be used anywhere. Distributed data processing allows multiple computers to be working among multiple geographically separate sites where local computers handle local processing needs. Distributed processing is a database's logical processing is shared among two or more physically independent sites that are connected through a network.

It dramatically reduced workstation costs and improved user interfaces and desktop power . It increases ability to share data across multiple servers. We define a distributed system as one in which hardware or software components located at networked computers communicate and coordinate their actions only by-passing messages.

Advantages : 1. Availability 2. Resource Sharing 3. Incremental Growth 4. Increased User Involvement and Control 5. End-user Productivity 6. Distance and location independence 7. Privacy and security Disadvantages : 1. More difficulty test and failure diagnosis 2. More components and dependence on communication means more points of failure 3. Incompatibility of components 4. Incompatibility of data 5. More complex management and control 6. Difficulty in control of corporate information resources 7. Suboptimal procurement 8. Duplication of effort

Distributed System Architecture Database System Architecture Database architecture focuses on the design, development, implementation and maintenance of computer programs that store and organize information for businesses, agencies and institutions. Centralized Architecture The centralized database system consists of a single processor together with its associated data storage devices and other peripherals. Data can be accessed from the multiple sites with the use of a computer network while the database is maintained at the central site. General-purpose computer system: One to a few CPUs and a number of device controllers that are connected through a common bus that provides access to shared memory. Single-user system (e.g., personal computer or workstation) : Desk-top unit, single user, usually has only one CPU and one or two hard disks; the OS may support only one user. Multi-user system: More disks, more memory, multiple CPUs and a multi-user OS. Serve a large number of users who are connected to the system via terminals. Often called server systems.

Advantages 1. The data integrity is maximized as the whole database is stored at a single physical location. This means that it is easier to coordinate the data and it is as accurate and consistent as possible. 2. The data redundancy is minimal in the centralized database. Disadvantages: If server fails, whole system fails. I t takes more time to search and access Bottlenecks can appear when the traffic spikes.

Client-Server Architecture The functionality is divided into two classes: Server and Client. A client is defined as a requester of services and a server is defined as the provider of services. Client/server computing is a very efficient and safe means of sharing database information in a multi-user environment This provides two-level architecture which makes it easier to manage the complexity of modern DBMSs and the complexity of distribution. Client: User machine that provides user interface capabilities and local processing. Server: System containing both hardware and software. It provides services to the client machines such as file access, printing, archiving, or database access. Server functions: Mainly data management, including query processing, optimization, transaction management, etc. Client functions: Might also include some data management functions not just user interface.

Client-server architecture may be either two-tier or three-tier. In a two-tier architecture, the server performs database functions and the clients perform the user interface functions. Either the server or the clients may perform business functions. In a three-tier architecture, the clients perform the user interface functions, a database server performs the database functions, and separate computers, called application servers, perform the business functions and act as interface between clients and database server. There is various classification of client server system. 1. Multiple client - single server 2. Multiple client - multiple server

Multiple clients - single server

Multiple Clients - Multiple Server

Problems with this type of architecture is as follows : 1. Single point of failure i.e. server failure 2. Bottleneck at server 3. Difficult to scale a database

2. Multiple Client - Multiple Server : Two alternative management strategies are possible: Either each client manages its own connection to the appropriate server or each client knows of only its” home server” which then communicates with other servers as required It consists of clients running client software, a set of servers which provide all database functionalities and a reliable communication infrastructure. When a server receives a, query that requires access to data at other servers, it generates appropriate sub-query to be executed by other server and put the results together to compute answer to the original query

Advantages of client-server architectures 1. Horizontal and vertical scaling of resources 2. Better price/performance on client machines 3. Ability to use familiar tools on client machines Client access to remote data Disadvantages of client-server architectures 1. Maintenance cost is more. 2. It suffers from security problem as number of user and processing sites increases. 3. Complexity increases

Server System Architecture Server systems can be broadly categorized into two kinds: 1. Transaction servers which are widely used in relational database systems. 2. Data servers, used in object-oriented database systems. Transaction server A transaction server is a specialized type of server that manages the operations of software based transactions or transaction processing. It manages application and database transactions on a network or Internet, within a distributed computing environment.

Transaction server allows you to break down transactions into components that perform discrete functions. A component notifies transaction server whether it has completed successfully or not . If all the components in a transaction complete successfully, the transaction is committed. If not, the transaction is rolled back . Transaction server is also called query server systems or SQL server systems. Requests specified in SQL , and communicated to the server through a Remote Procedure Call (RPC) mechanism . Open Database Connectivity (ODBC) is an application program interface standard from Microsoft for connecting to a server, sending SQL requests, and receiving results.

Data server: Data server is used in LANs, where there is a very high-speed connection between the clients and the server. Issues are as follows Page-shipping versus Item-shipping Locking Data caching Lock caching

Parallel System A system is said to be a parallel system in which multiple processor have direct access to shared memory which forms a common address space. Two main performance measures are: a. Throughput: The number of tasks that can be completed in a given time interval. b. Response time: The amount of time it takes to complete a single task from the time it is submitted.

Speed-up and Scale-up: Speedup: A fixed-sized problem executing on a small system is given to a system which is N -times larger. Scaleup: Increase the size of both the problem and the system. The N -times larger system used to perform N -times larger job.

Factors Limiting Speedup and Scaleup Startup costs: Cost of starting up multiple processes may dominate computation time, if the degree of parallelism is high. 2. Interference: Processes accessing shared resources (e.g., system bus, disks, or locks) compete with each other, thus spending time waiting on other processes, rather than performing useful work. 3. Skew: Increasing the degree of parallelism increases the variance in service times of parallelly executing tasks. Overall execution time determined by slowest of parallelly executing tasks.

Distributed System A distributed system is one in which components located at networked computers communicate and co-ordinate their actions only by-passing messages. Usually, distributed systems are asynchronous, i.e., they do not use a common clock and do not impose any bounds on relative processor speeds or message transfer times. Differentiate between local and global transactions are as follows A local transaction accesses data in the single site at which the transaction was initiated. A global transaction either accesses data in a site different from the one at which the transaction was initiated or accesses data in several different sites.

Trade-offs in distributed systems : a. Sharing Data : Users at one site able to access the data residing at some other sites. b. Autonomy: Each site is able to retain a degree of control over data stored locally. c. Higher System Availability through Redundancy : Data can be replicated at remote sites, and system can function even if a site fails.

Parallel Database Parallel database systems consist of multiple processors and multiple disks connected by a fast interconnection network . Parallel database improves the performance of processing of data using multiple resources simultaneously. Multiple resources like multiple CPU, Disks can be used simultaneously.

Goals of Parallel Databases Improve performance: The performance of the system can be improved by connecting multiple CPU and disks in parallel. Improve availability of data: Data can be copied to multiple locations to improve the availability of data. Improve reliability: Reliability of system is improved with completeness, accuracy and availability of data. Provide distributed access of data: Companies having many branches in multiple cities can access data with the help of parallel database system

Advantages of Parallel Databases: Performance improvement High availability : Same data is stored at multiple locations Increases reliability : Even if data site fails execution can continue as other copy of data are available.

I/O Parallelism I/O parallelism refers to reducing the time required to retrieve relations from disk by partitioning the relations on multiple disks . The most common form of data partitioning in a parallel database environment is horizontal partitioning. In horizontal partitioning, so that each tuple resid the tuples of a relation are divided among many disks, es on one disk. Several partitioning strategies have been proposed. Three basic data-partitioning strategies are Round Robin, Hash Partitioning Range Partitioning.

Partitioning techniques are as follows : Round-robin: In this method, scans the relation in any order and sends the i th tuple to disk number Di mod n. It ensures an even distribution of tuples across disks and each disk has approximately the same number of tuples as the others. Hash partitioning: It choose one or more attributes as the partitioning attributes. A hash function is chosen whose range is |0, 1,..., n — 1|. Each tuple of the original relation is hashed on the partitioning attributes. If the hash function returns i , then the tuple is placed on disk Di . Hash partitioning is also useful for sequential scans of the entire relation. Range partitioning: Choose an attribute as the partitioning attribute. A partitioning vector [v 0 , v 1-,-- , v n-2 ] is chosen. Let v be the partitioning attribute value of a tuple. Tuples such that v i <+vi+1 go to disk I + 1. Tuples with v < v 0 go to disk 0 and tuples with v >vn-2 go to disk n — 1.

Handling of Skew Data skew primarily refers to a non uniform distribution in a dataset. The direct impact of data skew on parallel execution of complex database queries is a poor load balancing leading to high response time. Skew are of two types : Attribute-value skew and Partition skew. 1. Attribute-value skew : Some values appear in the partitioning attributes of many tuples. All the tuples with the same value for the partitioning attribute end up in the same partition, resulting in skew. 2. Partition skew: refers to the fact that there may be load imbalance in the partitioning, even when there is no attribute skew.

Interquery Parallelism In interquery parallelism, different queries or transaction execute in parallel with one another. The response times of individual transactions are not faster than they would be if the transactions were run in isolation. Thus, the primary use of interquery parallelism is to scale up a transaction processing system to support a more significant number of transactions per second.

Intraquery Parallelism Intraquery parallelism defines the execution of a single query in parallel on multiple processors and disks. Using intraquery parallelism is essential for speeding up long-running queries . Two complementary forms of intraquery parallelism: 1. Intraoperation parallelism : Parallelize the execution of each individual operation in the query. 2. Interoperation parallelism : Execute the different operations in a query expression in parallel.

Distributed Database Concepts A logically interrelated collection of shared data physically distributed over a computer network. DDBMS components must include the following components 1. Computer workstations 2. Network hardware and software 3. Communications media 4. Transaction Processor: Software component found in each computer that receives and processes the application's requests data 5. Data processor or data manager: Software component residing on each computer that stores and retrieves data located at the site. It may be a centralized DBMS

Data is stored using following methods. 1. Replication 2. Fragmentation 3. Hybrid 4. Allocation Allocation of fragments is depends on the how the database will be used. First thing is to design database schema and then design application program. Information required from application is as follows : 1. Transaction execution frequency 2. Site in which transaction is executed 3. Condition for transaction performed

The distributed database systems contain local and global schema for all sites. Data storage is responsibility of data sites. Data is stored using following methods. 1. Replication 2. Fragmentation 3. Hybrid 4. Allocation Allocation of fragments is depends on the how the database will be used.

Information required from application is as follows : 1. Transaction execution frequency 2. Site in which transaction is executed 3. Condition for transaction performed

There are five big reasons for using a distributed database system: Many organizations are distributed in nature. Multiple databases can be accessed transparently. Database can be expanded incrementally - As needs arise, additional computers can be connected to the distributed database system. 4. Reliability and availability are increased - Distributed database can replicate data among several sites. So even if one site fails, redundancy in data will lead to increased availability and reliability of the data as a whole. 5. Performance will increase - Query processing can be performed at multiple sites and as such distributed database systems can mimic parallel database systems in a high-performance network.

Homogenous and Heterogeneous Databases. a)Homogeneous DDBMS b) All sites use same DBMS product c) It is much easier to design and manage d) The approach provides incremental growth and allows increased performance. e) Homogeneous systems are much easier to design and manage f) Homogeneous DBs can communicate directly with each other

Heterogeneous DDBMS a) Sites may run different DBMS products, with possibly different underlying data models. b) This occurs when sites have implemented their own databases first, and integration is considered later. c) Translations are required to allow for different hardware and/or different DBMS products d) Typical solution is to use gateways e) Heterogeneous DBs communicate through gateway interfaces f) No DDBMS currently provides full support for heterogeneous or fully heterogeneous DDBMSs

Distributed Data Storage A Relation (r) is stored in the database. Two methods are used for storing relations on the distributed database. a) Replication: The system maintains several identical replicas of the relation and stores each replica at a different site. The alternative to replication is to store only one copy of relation r. b) Fragmentation : The system partitions the relation into several fragments, and stores each fragment at a different site.

Data Replication If relation r is replicated, a copy of relation r is stored in two or more sites. Using data replication, each logical data item of a database has several physical copies, each of them located on a different machine, also referred to as site or node. In full replication the entire database is replicated and in partial replication some selected part is replicated to some of the sites.

Data Fragmentation Fragmentation is a design technique to divide a single relation or class of a database into two or more partitions such that the combination of the partitions provides the original database without any loss of information. The main reasons of fragmentation of the relations are to 1. Increase locality of reference of the queries submitted to database, 2. Improve reliability and availability of data and performance of the system, 3. Balance storage capacities and minimize communication costs among sites.

Horizontal Fragmentation The horizontal fragmentation of a relation R is the subdivision of its tuples into subsets called fragments Each fragment, Ti of table T contains a subset of the rows. Each tuple of T is assigned to one or more fragments. Horizontal fragmentation is lossless. It is defined as selection operation.

There are two versions of horizontal partitioning: a. Primary horizontal fragmentation of a relation is achieved through the use of predicates defined on that relation which restricts the tuples of the relation. b. Derived horizontal fragmentation is realized by using predicates that are defined on other relations.

Vertical Fragmentation Some of the columns of a relation are projected into a base relation at one of the sites, and other columns are projected into a base relation at another site.

Two types of heuristics for vertical fragmentation exist : 1. Grouping : assign each attribute to one fragment, and at each step, join some of the fragments until some criteria is satisfied. It uses bottom-up approach. 2. Splitting : starts with a relation and decides on beneficial partitioning based on the access behavior of applications to the attributes. It uses top-down approach.

Distributed Transparency Distributed Transparency Distribution transparency, which allows a distributed database to be treated as a single logical database. Following are different levels of distribution transparency are recognized: 1. Fragmentation transparency 2. Location transparency 3. Replication transparency

Fragmentation transparency is the highest level of transparency. The end user or programmer does not need to know that a database is partitioned. Location Transparency Location transparency exists when the end user or programmer must specify the database fragment name but does not need to specify where those fragments are located.

Replication Transparency There might be more than one copy of a table stored in the system should be hidden from the user. This provides for replication transparency, which enables the user to query any table as if there were only one copy of it.

Distributed Transactions A distributed transaction is composed of several sub-transactions, each running on a different site. Each database manager can decide to abort.

When a transaction is submitted the transaction manager at that site breaks it up into one or more sub transactions that execute at different sites, submits them to the transaction manager at those sites, and coordinates their activity. Why are distributed transactions hard? 1. Atomic : Different parts of a transaction may be at different sites. How do we ensure all or none committed? 2. Consistent : Failure may affect only part of transaction 3. Isolated : Commitment must occur "simultaneously" at all sites 4. Durable : Not much different when other problems solved. It also makes "delayed commit" difficult.

For each such transaction, the coordinator is responsible for : Starting the execution of the transaction. Breaking the transaction into a number of sub-transactions and distributing these sub-transactions to the appropriate sites for execution. 3. Coordinating the termination of the transaction, which may result in the transaction being committed at all sites or aborted at all sites.

System Failure Modes The basic failure types are: 1. Failure of a site 2. Loss of messages 3. Failure of a communication link Network partition

Scenario: Blue (1) sends to Blue (2) “lets attack tomorrow at dawn” later, Blue (2) sends confirmation to Blue (1) “splendid idea, see you at dawn” but, Blue (1) realizes that Blue (2) does not know if the message arrived, So, Blue (1) sends to Blue (2) “message arrived, battle set” then, Blue (2) realizes that Blue(1) does not know if the message arrived etc. The two blue armies can never be sure because of the unreliable communication. No certain agreement can be reached using this method. Transaction : Sequence of actions treated as an atomic action to preserve consistency (e.g. access to a database). Commit a transaction: Unconditional guarantee that the transaction will complete successfully (even in the presence of failures). Abort a transaction : Unconditional guarantee to back out of a transaction, i e., that all the effects of the transaction have been removed. Events that may cause aborting a transaction are deadlocks, timeouts, protection violation etc. Mechanisms that facilitate backing out of an aborting transaction are Write-ahead-log protocol and shadow pages. Commit protocol ensure that all the sites either commit or abort transaction unanimously, even in the presence of multiple and repetitive failures.

Two-Phase Commit Protocol The two-phase commit protocol is a distributed algorithm which lets all sites in a distributed system agrees to commit a transaction . The protocol results in either all nodes committing the transaction or aborting, even in the case of site failures and message losses.

The two phases of the algorithm are broken into: 1. The COMMIT-REQUEST phase, where the COORDINATOR attempts to prepare all the COHORTS, and 2. The COMMIT phase, where the COORDINATOR completes the transactions at all COHORTS.

Basic algorithm During phase 1 , initially the coordinator sends a query to commit message to all cohorts. Then it waits for all cohorts to report back with the agreement message. The cohorts, if the transaction was successful, write an entry to the undo log and an entry to the redo log. Then the cohorts reply with an agree message, or an abort if the transaction failed at a cohort node. During phase 2, if the coordinator receives an agree message from all cohorts, then it writes a commit record into its log and sends a commit message to all the cohorts. If all agreement messages do not come back the coordinator sends an abort message. Next the coordinator waits for the acknowledgement from the cohorts. When acks are received from all cohorts the coordinator writes a complete record to its log. Note that, the coordinator will wait forever for all the acknowledgements to come back. If the cohort receives a commit message, it releases all the locks and resources held during the transaction and send an acknowledgement to the coordinator. If the message is abort, then the cohort undoes the transaction with the undo log and releases the resources and locks held during the transaction. Then it sends an acknowledgement. At the COORDINATOR:

Three-Phase Commits Protocol Three-Phase Commits (3PC) protocol or Non-blocking for site failures, except in event of failure of all sites. Communication failures can result in different sites reaching different decisions, thereby violating atomicity of global transactions. Introduces third phase, called pre-commit, between voting and global decision. On receiving all votes from participants, coordinator sends global pre-commit message. Participant who receives global pre-commit, knows all other participants have voted commit and that, in time, participant itself will definitely commit sent Abort received Prepare sent sent

Basic 3PC protocol : Phase 1 : The coordinator sends VOTE_REQ to all participants. When a participant receives VOTE_REQ, it responds with YES or NO, depending on its vote. If a participant votes NO, it decides abort and stops. Phase 2 : The coordinator collects all votes. If any vote was NO, then the coordinator decides abort, sends ABORT to all participants that voted YES, and stops. Otherwise, the coordinator sends PRE_COMMIT messages to all participants. A participant that votes YES waits for a PRE_COMMIT or ABORT message from the coordinator. If it receives a PRE_COMMIT, then it responds with an ACK message. Phase 3 : The coordinator collects the ACKs. When they have all been received, it decides commit, sends COMMITs to all participants, and stops. A participant waits for a COMMIT from the coordinator. When it receives that message, it decides commit and stops.

Concurrency Control and its Problem In distributed database systems, database is typically used by many users. These systems usually allow multiple transactions to run concurrently i.e. at the same time . Concurrency control is the activity of coordinating concurrent accesses to a database in a multiuser database management system (DBMS).

Main Objectives of Distributed Concurrency Control 1. It must be recovery from site and communication failures. 2. It must support parallel execution of transaction. 3. Storage mechanism and computational method should be modest to minimize overhead. 4. Communication delay is less. 5. Few constraints on structure of atomic actions of transactions.

Main Objectives of Distributed Concurrency Control 1. It must be recovery from site and communication failures. 2. It must support parallel execution of transaction. 3. Storage mechanism and computational method should be modest to minimize overhead. 4. Communication delay is less. 5. Few constraints on structure of atomic actions of transactions. Concurrency Control Anomalies Co-ordination of simultaneous transaction execution in a multiprocessing database system Lack of Concurrency Control can create data integrity and consistency problems

1. Lost updates 2. Uncommitted data 3. Inconsistent retrievals

Lost Update Problems (W - W Conflict) The problem occurs  when two different database transactions perform the read/write operations on the same database items in an interleaved manner (i.e., concurrent execution) that makes the values of the items incorrect hence making the database inconsistent Consider the below diagram where two transactions T X  and T Y , are performed on the same account A where the balance of account A is $300

At time t1, transaction TX reads the value of account A, i.e., $300 (only read). At time t2, transaction TX deducts $50 from account A that becomes $250 (only deducted and not updated/write). Alternately, at time t3, transaction TY reads the value of account A that will be $300 only because TX didn't update the value yet. At time t4, transaction TY adds $100 to account A that becomes $400 (only added but not updated/write). At time t6, transaction TX writes the value of account A that will be updated as $250 only, as TY didn't update the value yet. Similarly, at time t7, transaction TY writes the values of account A, so it will write as done at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is lost. Hence data becomes incorrect, and database sets to inconsistent.

Unrepeatable Read Problem (W-R Conflict) Its also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different values are read for the same database item. For example: Consider two transactions, TX and TY, performing the read/write operations on account A, having an available balance = $300. The diagram is shown below:

At time t1, transaction TX reads the value from account A, i.e., $300. At time t2, transaction TY reads the value from account A, i.e., $300. At time t3, transaction TY updates the value of account A by adding $100 to the available balance, and then it becomes $400. At time t4, transaction TY writes the updated value, i.e., $400. After that, at time t5, transaction TX reads the available value of account A, and that will be read as $400. It means that within the same transaction TX, it reads two different values of account A, i.e., $ 300 initially, and after updation made by transaction TY, it reads $400. It is an unrepeatable read and is therefore known as the Unrepeatable read problem.
Tags