Cost & selectivity estimation: ★Page-
oriented Nested Loop join: I/O Cost:
#pages of R + #pages of R * #pages of
S – Choose order of R, S, so that
#pages of R < #pages of S
★Selectivity estimates: Crude
estimation假设均匀分布 : ① Selectivity
= 1 / #distinctValue,没有告诉信息的
话就用1/10 ② Estimated #results =
#records / #distinctValues ③ Range
queries: range size / domain size(i.e.
#distinctValues) ④ Free if there is an
index!
★Join cardinality estimates: Cardinality
estimate gives an estimate of how
many rows will result from the join.
Tansaction & Distributed Transaction
★Transactions are the basic unit of
change in the DBMS. No partial txns are
allowed.
Transaction Schedules:
• Serial schedule:
Schedule that does not
interleave the actions of
different transactions. •
Serializable schedule:
A schedule that is equivalent to some serial execution of the transactions. • If each transaction
preserves consistency, every serializable schedule preserves consistency.
★Conflict Serializable Schedules: Conflict equivalent: ① They involve the same actions. ② Every pair of
conflicting actions is ordered the same way. • Schedule S is conflict serializable if S is conflict equivalent
to some serial schedule.
★Precedence Graph: One node per txn; edge from Ti to Tj if Tj reads/writes an object last read/written
by Ti. 冲突对指向时间后者. A schedule is conflict serializable iff its dependency
graph is acyclic. CS一定是serializable, 但serializable不一定是 CS eg. T1T2T3和
T2T3T1一样,都会最终被 T3 overwritten. 没圈=>CS.
Two-phase locking (2PL) concurrency control: A transaction is divided into two
phases: the growing phase, where it acquires all the locks it needs and does not release any, and the
shrinking phase, where it releases locks and cannot acquire any new ones. This rule prevents a
transaction from acquiring new locks after it has started releasing locks. • 2PL allows only schedules
whose precedence graph is acyclic => serializable.
Strict two-phase locking (2PL) concurrency control: All locks held by a transaction are retained until the
transaction commits or aborts. The main difference from basic 2PL is that in basic 2PL, locks can be
released during the shrinking phase before the transaction completes, while Strict 2PL holds all locks
until the very end. This approach ensures that other transactions cannot read or write the data affected
by the uncommitted transaction. It simplifies transaction aborts.
★2PL 缺: – Lock management overhead. – Deadlock detection/resolution. – Lock contention竞争 for
heavily used objects.
Deadlocks: ★Detection, Waits-for graph: – Nodes are transactions – Edge from Ti to Tj if Ti is waiting
for Tj to release a lock. 等的指向用的 .
★Prevention: • Assign priorities based on timestamps. – Earlier timestamp, higher priority • Assume
Ti wants a lock that Tj holds. Two policies: ① Wait-Die: It Ti has higher priority, Ti waits for Tj. Otherwise
Ti aborts.高等, 低自杀. ②Wound-wait: If Ti has higher priority, Tj aborts. Otherwise Ti waits.高杀对方, 低
等待.
Fixed vs dynamic dataset: ★Fixed database: – Set of tuples is fixed – Can update, but no inserts/deletes
– Can lock all related tuples ★Dynamic databases: – Can insert/delete tuples – Cannot lock all related
tuples • If insertions/deletions are allowed, then even Strict 2PL cannot assure serializability • Conflict
serializability guarantees serializability only if the set of objects is fixed!
★Predicate locking: aims to solve the problems of serializability in dynamic databases by locking not just
specific records, but all records that satisfy a certain predicate, both existing and those that might be
added in the future. 缺: Expensive
★Index locking: Locking database indices that correspond to the predicate. This approach can be
effective but may lock out more records than necessary, especially if the index covers a broad range of
values. More efficient than predicate locking.
Pessimistic vs Optimistic: • Two-phase locking (2PL) – Pessimistic approach – Assume txns will conflict
– Acquire locks on all items before accessing them!
• Timestamp ordering (T/O) – Optimistic approach – Assume that conflicts are rare – Do not acquire
locks, check for conflicts at commit
Optimistic control: The Kung-Robinson Model: Txns have three phases: ① READ: txns read from the
database, but make changes to private copies of objects. ② VALIDATE: Check for conflicts. ④ WRITE:
Make local copies of changes public.
★Validation: For validation, each transaction is given a
unique identifier, typically a timestamp, which is assigned
at the end of its read phase, just before validation begins.
Test conditions that are sufficient to ensure that no
conflict occurred. ① For all i and j such that Ti < Tj ,
check that Ti completes before Tj begins. ② For all i
and j such that Ti < Tj, check that: • Ti completes
before Tj begins its Write phase • WriteSet(Ti) ∩
ReadSet(Tj) is empty. ③ For all i and j such that Ti <
Tj, check that: • Ti completes read phase before Tj
does • WriteSet(Ti) ∩ ReadSet(Tj) is empty •
WriteSet(Ti) ∩ WriteSet(Tj) is empty. 二三为了检查 Does Tj read dirty data? Does Ti overwrite Tj’s
writes? 如果1过了, 就证明valid, 不用往下走了 . 如果1没过, 要进行第二步 , WriteSet(Ti前辈) ∩
ReadSet(Tj后来者) 是空则valid, 否则not valid重启Tj.
Assignment of txn id, validation, and the Write phase are inside a critical section! – Nothing else goes on
concurrently. – So, no need to check for Test 3 --- can’t happen.
★Overheads in Optimistic CC: ① Record read/write activity in ReadSet and WriteSet per txn. – Must
create and destroy these sets as needed. ② Check for conflicts during validation, and make validated
writes “global”. – Critical section can reduce concurrency.
③ Optimistic CC restarts txns that fail validation. – Work done so far is wasted. –Requires clean-up.
Parallel architectures: Shared
Memory is easy to use and
program, but very expensive to
scale. Shared Nothing is the
most scalable architecture
(lowest contention) but most
difficult to administer and tune.
Two-phase commit (2PC)
protocol: is an ACID-compliant
protocol
1. The coordinator sends a prepare message to each subordinate.
2. When a subordinate receives a prepare message. It force-writes an abort or prepare log record, and
then sends a no or yes message to the coordinator. Notice that a prepare log record is not used in a
centralized DBMS; it is unique to the distributed commit protocol.
3. If the coordinator receives yes messages from all subordinates, it force-writes a commit log record
and then sends a commit message to all subordinates. If it receives even one no message, or does not
receive any response from some subordinate for a specified time-out interval, it force-writes an abort log
record, and then sends an abort message to all subordinates (这一点取决于题 目, wherether nodes that
voted no are notified of the final decision, 如果不是就不用发 abort给 subordinates).
4. When a subordinate receives an abort message, it force-writes an abort log record, need not send an
ack message to the coordinator, aborts the subtransaction. When a subordinate receives a commit
messag, it force-writes a commit log record, sends an ack message to the coordinator, and commits the
subtransaction.
5. After the coordinator has received ack messages from all subordinates, it writes an end log record for
the transaction.
1-2 voting phase, 3-5 termination/complete/commit phase, both initiated by the coordinator. The basic
principle is that any of the transaction managers (领导+从属) involved can unilaterally abort a
transaction, whereas there must be unanimity一致同意 to commit a transaction.发送者always会在消息
发送前log decision to stable storage. A transaction is officially committed at the time the coordinator’s
commit log record reaches stable storage. Subsequent failures cannot affect the outcome of the
transaction.
★Presumed commit(左) & abort(右):
Restart After a Failure at a Site:
① If we have a commit/abort log rec for Xact T, but not an end rec, must redo/undo T.
– If coordinator, keep sending commit/abort msgs to subs until acks received. If subordinate, 直接
redo/undo ② If we have a prepare log rec for Xact T, but not commit/abort, this site is a subordinate for
T. – Repeatedly contact the coordinator to find status of T, then write commit/abort log rec; redo/undo T;
and write end log rec. ③ If we don’t have even a prepare log rec for T. If subordinate, unilaterally abort
and undo T. If coordinator, subordinates may send messages later.
★Coordinator failures & Blocking: ① If coordinator fails after sending prepare msgs but before writing
commit/abort log recs, when it comes back up it aborts the Xact. ② If coordinator for Xact T fails,
subordinates who have voted yes cannot decide whether to commit or abort T until coordinator recovers.
– T is blocked. – Even if all subordinates know each other, they are blocked unless one of them voted
no.
★Link and Remote Site Failures: If a remote site does not respond for Xact T, either because the site
failed or the link failed: ① If the current site is the coordinator for T, should abort T. ② If the current site
is a subordinate, and has not yet voted yes, it should abort T. ③ If the current site is a subordinate and
has voted yes, it is blocked until the coordinator responds.
★Refine: 1. If a subtransaction has not changed the database (which can be easily detected by
keeping a count of update log records), the subordinate can respond to a prepare message from the
coordinator with a reader message, instead of yes or no. The subordinate writes no log records in this
case.
2. When a coordinator receives a reader message, it treats the message as a yes vote, but with the
optimization that it does not send any more messages to the subordinate, because the subordinate’s
commit or abort status is irrelevant.
3. If all the subtransactions, including the subtransaction at the coordinator site, send a reader message,
we don’t need the termination phase. Indeed, we can simply remove the transaction from the transaction
table, without writing any log records at any site for this transaction.
例题: ★无论是是否 presumed aborts, yes都要正常来往 commit和ack, 4(N-1). 有任何一个说 no, 2PC正
常回复abort, 但不用等 ack, 3(N-1); presumes abort, 不用往来 abort和ack, 2(N-1); 如果说coordinator
notifies all subordinates of the final decision, 那答案就是 4(N-1)和3(N-1). ★① The coordinator fails after
sending the prepare messages. ② A subordinate fails before receiving a prepare message. ③
Assuming 2PC with Presumed Aborts: The coordinator aborts a transaction, but the abort message to a
subordinate is lost. ① All the subordinates are blocked. ② There is a timeout in the coordinator waiting
the failed subordinate and the transaction aborts. ③ The coordinator removes the transaction from the
transaction table. The subordinate that didn’t receive the abort message, re-tries to send yes or no. The
coordinator doesn’t find the transaction in the transaction table and replies abort.
Replication: ★Synchronous: All copies of a modified relation must be updated before the modifying Xact
commits. ① Voting: write a majority of copies; read enough copies to be sure of seeing at least one most
recent copy. Eg. 10 copies; 7 written for update; 4 copies read. 缺: Read multiple copies can be
resource-intensive and slow. ② Read-Any Write-All: All copies must be writte. 缺: write operations
slower because they only complete once all replicas have been updated. 优: A read operation can be
performed on any copy, making read operations faster because they can be completed as soon as a
single replica responds. ★Cost of Synchronous: Before an update Xact can commit, it must obtain locks
on all modified copies. – Sends lock requests to remote sites, and while waiting for the response, holds
on to other locks! – If sites or links fail, Xact cannot commit until they are back up. – Even if there is no
failure, committing must follow an expensive commit protocol with many msgs.
★Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies
may get out of sync in the meantime. ① Primary Site: Exactly one copy of a relation is designated the
primary or master copy. Replicas at other sites cannot be directly updated. – The primary copy is
published. – Other sites subscribe to this relation; these are secondary copies. ▌Log-Based Capture:
The log (kept for recovery) is used to generate a Change Data Table (CDT). It’s cheaper & faster, but
relies on proprietary log details. ▌Logical log-based capture(Middle-ground): Row-based replication:
Describe edits at row granularity. ▌Procedural Capture: Take a snapshot. ② Peer-to-Peer (multi-leader)
Replication: More than one of the copies of an object can be a master in this approach. Changes to a
master copy must be propagated to other copies somehow. Log-Based Capture plus continuous Apply
minimizes delay in propagating changes. Procedural Capture plus application-driven Apply is the most
flexible way to process changes.
MapReduce Distributed Query Processing: ★Declustering: ① Attribute-less partitioning, eg. random,
Round-Robin ② Single Attribute Schemes, eg. hash de-clustering, range de-clustering
③ Multiple Attributes schemes possible, eg. MAGIC, BERD
★Hash Declustering: 比如salary%3, 余数是0, 1, 2的分别分到三个 nodes里. – Selections with equality
predicates referencing the partitioning attribute are directed to a single node: salary = 60K. – Equality
predicates referencing a non-partitioning attribute and range predicates are directed to all nodes: age =
20, salary < 20K.
★Range Declustering: 例如salary 0-50K, 51K-100K, 101K-ꝏ分别分到 3个nodes里. – Equality and
range predicates referencing the partitioning attribute are directed to a subset of nodes: salary = 60K,
salary < 20K (both are directed to one node). – Predicates referencing a non-partitioning attribute are
directed to all nodes.
★Declustering Tradeoffs: • When selectivity is low, Range spreads out load and was ideal • When
selectivity is increased, Range causes high workload on one/few nodes while Hash spreads out load.
★Distributed Join: • Partition inputs to buckets; each bucket fits in join processors’ aggregate memory •
Partition and join each bucket pair across join processors
★Distributed Aggregation: Step 1: compute aggregate locally for each node. Step 2: redistribute by
hashing group attribute and aggregate partial results.
★Summary: • Distributed query execution is built on top of partitioning • Declustering can reduce
selection costs but can also lead to load skew • Partitioned operators focus on both parallelism and
memory requirements.
Map reduce: Distributed computation is usually simple but requires complex code to handle distribution,
fault tolerance, scheduling, partitioning. The MapReduce approach: code using functional model, hide
complexity behind a library.
Map(key, value): (key, value) → list((key, value)). Reduce(key, list(values)): (key, list(values)) →
(key, value)
缺: extensive IO ie. intermediate results are saved as regular files in the DFS (takes a lot of
time), shuffling consumes network and causes delay (can do progressive shuffling => start reducers
earlier), limited expressiveness (the way I wrote code is meh).