data_management_and _data_science_cheat_sheet.pdf

JiananWang21 87 views 2 slides May 05, 2024
Slide 1
Slide 1 of 2
Slide 1
1
Slide 2
2

About This Presentation

This is my cheat sheet for the course Data Management.


Slide Content

Storage Hierarchies and Data Layout Storage Technologies: ★CPU registers < CPU caches < DRAM
< SSD < HDD < Network Storage 蓝volatile黑non-volatile
★Average access time (AAT) = Hit time + ((miss rate) X (miss penalty)) Hit timecache < Hit timememory →
AATcache << AATmemory
★Non-volatile Memory (NVM): Eg. flash(SDD), ROM, FRAM. Goals: data persists after even when
power is turned off + reduce random/sequential access gap further than SSD, providing uniform access
times similar to DRAM. ① Like DRAM, low-latency loads and stores. ② Like SSD, persistent writes and
high density.
除了断电不丢,和没有随机 /顺序获取的 差距,还有以下特点 :
• No seek/rotational delays (These are characteristics of HDDs. Seek delay refers to the time it takes
for the read/write head to move to the correct track on the platter. Rotational delay is the time waiting
for the platter to rotate to the correct position under the read/write head.) • Byte-addressable (because
it allows data to be read and written at the level of individual bytes, much like DRAM) • High read/write
throughput
★• SSD uses non-volatile flash chips. Package multiple flash chips into a single closure. • SSD
controller: Embedded processor that executes firmware-level software. Bridges Flash chips to the SSD
input/output interfaces (like SATA or NVMe).• Block-addressable. (This means when data needs to be
updated, the entire block containing that data must be read, modified, and written back, even if only a
small amount of data within the block is changed.与Byte-addressable不同)
File Storage: The
DBMS stores a
database as one or
more files on disk.
★Alternative File
Organizations: •
Heap files: Best
when typical access
is a full file scan •
Sorted Files: Best for
retrieval in an order,
or for retrieving a
‘range’ • Log-
structured Files: Best
for very fast
insertions/deletions/updates
① As file grows and shrinks, disk pages are allocated and de-allocated. – contains records in no
particular order – Need to be able to scan, search based on rid . – Need to manage free space, 预防
fragmented space
★Heap file implemented using lists: • One header page which contains links to full data pages and data
pages that have some free space. <Heap file name, header page id> stored somewhere • Each page
contains 2 ‘pointers’ plus data. 缺: Perf. Everytime you wanna find/write sth, you need to traverse many
pages. This directory can quickly direct you to the specific page you need without traversing.
★Heap File Using a Page Directory: directory里边有data page的index和meta data, 比如page的剩余
空间. Directory structure requires less space than maintaining a linked list for all pages.
★Log-structured files: Instead of storing tuples in
pages, the DBMS only appends log records. Blocks are
never modified. – DBMS scans log backwards, and
“recreates” the tuple – Build indexes to allow jumps in
the log – Periodically cleanup the log
• Pros – High performance for inserts, deletes and
updates – Ultra-fast recovery from failures – Good for
SSD as writes are naturally leveled (每个SDD的mem
cell被写的次数是有限的 , data is written sequentially to
the log, spreading the write operations evenly across all
the cells.)
• Cons: – Unpredictable performance in sequential
reads – Need a lot of free space – Affects garbage collection (need for compaction) – Data can be lost if
written but not checkpointed (a process that periodically takes a snapshot of the database)
Buffer Management: DBMS needs to do things “its own way”: ① prefetching ② Control over buffer
replacement policy ③ Control over thread/process scheduling: Manage concurrent operations to avoid
the convoy problem, 因为等别的任务释放锁所造成的延迟 . ④ Control over flushing data (memory) to
disk: Write-Ahead Logging (WAL) protocol requires that changes (log entries) are saved to disk before
the actual data pages are written.
★Data must be in RAM for DBMS to operate on it! When a page is requested: ⑩ Buffer pool
information table contains: <frame#, pageid (on disk), pin_count (be accessed 就+1), dirty (被改了但还
没写回disk)> ② If requested page is not in pool: – Choose a frame for replacement (only un-pinned
pages are candidates) – If frame is “dirty”, write it to disk – Read requested page into chosen frame ③
Pin the page and return its address in mem.
★Least Recently Used (LRU) vs. Most Recently Used (MRU): Problem: # buffer frames < # pages in file
means each page request causes an I/O. MRU much better in this situation (Sequential flooding)
★“Clock” Replacement Policy:
Page Layout, NSM (row store): ★The N-ary Storage Model: • Page = collection of slots
• Each slot stores one record ① Record identifier: <page_id, slot_number> ② Option 2: <uniq> →
<page_id, slot_number>
★Page formate, Fixed-Length: • Schema is stored in system
catalog (# of data, # of colnms, type of each colmn) – # of fields
(clmns) is fixed for all records of a table – The type of data each
field can hold is consistently • Each field has fixed length
• Record id = <page id, slot #>
• Packed case, insert因为每个slot的大小固
定且已知 , 可以算出 free space开始的地方,
所以就直接加在 free spce且# of
recordes(N)+1; delete slot2, moving records
填补空缺需要 changes rid; maybe
unacceptable.所以这种方法对 insert好, 对
delete不好.
• Unpacked case, if we have M slots, we
need M bits. Delete just has to click that bit
from the corresponding bit from 1 to 0.
★Page formate, Variable-Length (# fields is fixed):
Fields Delimited by Special Symbols: No matter the
length of each field, the special symbol makes it
clear where one field ends and the next begins.
Have to scan through the record to find fields.
Array of Field Offsets: offers direct access to i’th
field, efficient storage of nulls (the offset for that field
= the offset of the next field证明null field. 而第一种
你还是要插入 $, 如果两个$$之间没有东西证明 null field); small directory overhead.
• Need to move records in a page – Allocation/deletion must find/release free space
• Maintain slot directory with <record offset, record length> pairs – Records can move on page without
changing rid – Useful for freely moving fixed-length records (ex: sorting)
缺: • If a field grows and no longer fits? – shift all subsequent fields • If
record no longer fits in page? – Move a record to another page after
modification • What if record size > page size? – Limit allowed record
size
Page Layout, DSM (column store): ★Decompose a relational table to
sub-tables per attribute. Example: • Columns stored in pages –
Denoted with different colors
• Each column can be accessed individually – Pages loaded only for
the desired attributes
★Pros: ① Saves IO by bringing only the relevant attributes ② Each
column contains only one type of data, it can be compressed more
effectively than a row with multiple data types.
Cons: ① Writes more expensive (require changes across multiple
columns rather than just writing to a single row) ② Need tuple
stitching at some point (Tuple Reconstruction) 为了得到一个 row ③
Indexed selection with low selectivities (still results in a large portion
of the column being read) ④ Queries that require all or most of the
attributes
★Compression: ① RLE: "AAAABBBCC" would be stored as "A4B3C2” on the page. 如果想要找 某一id
对应的值需要 decompress. ② Bit-vector encoding:– Useful when we have categorical data & Useful
when a few distinct values – One bit vector for each distinct value – Vector length = # distinct elements
③ Dictionary encoding – Replace long values (e.g., strings) with integers ④ Frequency partitioning –
Reorganize each column to reduce entropy at each page. Smaller dictionaries improve - memory
requirements (不需要for整个列的大字典了 ) - cache utilization (Smaller dictionaries are more likely to fit
into caches) - effectiveness of run-length encoding (想像所有 100个A都在一个 page里,你直接存
A100一次;如果散落 , 你可能要存 A20一个page里, A80另一个page里)
★Write-optimized storage: to make write operations more efficient in columnar storage systems. Write
rows in-memory, periodically flush columns to disk
Page Layout, PAX (Partition Attributes Across, hybrid):
Query Execution & Optimization 左是outer relation右是inner relation
Late Materialization: In the context of column-oriented databases, this means working with column
values for as long as possible before combining them to form the complete dataset required for the
query's output.
Relational Algebra Equivalences:














★I/O cost example:
#Tps/pg = floor(pg size / tp size)
# pgs, I/O pgs, # seeks → celling
I/O Pages - BNL
(1) A + A/buffer[A] * B
(2) 如果A已经在内存里了 , 那只需要 A/buffer[A] * B
Seeks – BNL
(1) 2 * A/buffer[A]
(2) 如果A已经在内存里了 , A/buffer[A]
I/O Pages - Merge
(1) A + B
(2) 如果A已经在内存里了 , 那只需要 B
Seeks - Merge
(1) A/buffer[A] + B/buffer[B]
(2) 如果A已经在内存里了 , 要seek B, 那只需要 seek一次
Heuristic-based optimization: Define static rules that transform logical operators to a physical plan ①
Perform most restrictive selections early ② Perform all selections before joins ③
Predicate/Limit/Projection pushdowns ④ Join ordering based on cardinality
★INGRES optimizer:优: ① Easy to implement and debug. ② Works reasonably well and is fast for
simple queries & small tables. 缺: ① Doesn’t truly handle joins. ② Join ordering based only on
cardinalities (i.e., the number of rows). ③ Naïve, nearly impossible to generate good plans when
operators have complex interdependencies.
Heuristics + cost-based optimizer: Use static rules to perform initial optimization. Then use dynamic
programming to determine the best join order for tables. 优: ① Usually finds a reasonable plan without
having to perform an exhaustive search. ② Order of results also considered! 缺: Depends on heuristics:
① Left-deep join trees are not always optimal ② Space exploration(i.e.找最佳join order) only considers
joins.
★SYSTEM R Optimizer (Both heuristics and cost. Efficient and usually derives reasonable plans. Relies
on Principle of optimality & Interesting orders.): Step 1: Break query up into blocks and generate the
logical operators for each block. Block: No nested queries, exactly one SELECT & FROM, and at most
one WHERE, GROUP BY, HAVING. Step 2: For each individual block: ① For each logical operator,
consider a set of physical operators & offered access paths. ② Iteratively construct a “left-deep” tree that
minimizes the estimated amount of work to execute the plan.
★Principle of optimality: Choose the cheapest path to create A⨝B⨝D, and consider it fixed for higher
levels! 缺: May lead to suboptimal plans, 没来考OrderBy, avoided by sort merge join. • Relaxed
principle of optimality – Consider order-by clause! A plan is compared with all other plans that produce
the same order.

Cost & selectivity estimation: ★Page-
oriented Nested Loop join: I/O Cost:
#pages of R + #pages of R * #pages of
S – Choose order of R, S, so that
#pages of R < #pages of S
★Selectivity estimates: Crude
estimation假设均匀分布 : ① Selectivity
= 1 / #distinctValue,没有告诉信息的
话就用1/10 ② Estimated #results =
#records / #distinctValues ③ Range
queries: range size / domain size(i.e.
#distinctValues) ④ Free if there is an
index!
★Join cardinality estimates: Cardinality
estimate gives an estimate of how
many rows will result from the join.
Tansaction & Distributed Transaction
★Transactions are the basic unit of
change in the DBMS. No partial txns are
allowed.
Transaction Schedules:
• Serial schedule:
Schedule that does not
interleave the actions of
different transactions. •
Serializable schedule:
A schedule that is equivalent to some serial execution of the transactions. • If each transaction
preserves consistency, every serializable schedule preserves consistency.
★Conflict Serializable Schedules: Conflict equivalent: ① They involve the same actions. ② Every pair of
conflicting actions is ordered the same way. • Schedule S is conflict serializable if S is conflict equivalent
to some serial schedule.
★Precedence Graph: One node per txn; edge from Ti to Tj if Tj reads/writes an object last read/written
by Ti. 冲突对指向时间后者. A schedule is conflict serializable iff its dependency
graph is acyclic. CS一定是serializable, 但serializable不一定是 CS eg. T1T2T3和
T2T3T1一样,都会最终被 T3 overwritten. 没圈=>CS.
Two-phase locking (2PL) concurrency control: A transaction is divided into two
phases: the growing phase, where it acquires all the locks it needs and does not release any, and the
shrinking phase, where it releases locks and cannot acquire any new ones. This rule prevents a
transaction from acquiring new locks after it has started releasing locks. • 2PL allows only schedules
whose precedence graph is acyclic => serializable.
Strict two-phase locking (2PL) concurrency control: All locks held by a transaction are retained until the
transaction commits or aborts. The main difference from basic 2PL is that in basic 2PL, locks can be
released during the shrinking phase before the transaction completes, while Strict 2PL holds all locks
until the very end. This approach ensures that other transactions cannot read or write the data affected
by the uncommitted transaction. It simplifies transaction aborts.
★2PL 缺: – Lock management overhead. – Deadlock detection/resolution. – Lock contention竞争 for
heavily used objects.
Deadlocks: ★Detection, Waits-for graph: – Nodes are transactions – Edge from Ti to Tj if Ti is waiting
for Tj to release a lock. 等的指向用的 .
★Prevention: • Assign priorities based on timestamps. – Earlier timestamp, higher priority • Assume
Ti wants a lock that Tj holds. Two policies: ① Wait-Die: It Ti has higher priority, Ti waits for Tj. Otherwise
Ti aborts.高等, 低自杀. ②Wound-wait: If Ti has higher priority, Tj aborts. Otherwise Ti waits.高杀对方, 低
等待.
Fixed vs dynamic dataset: ★Fixed database: – Set of tuples is fixed – Can update, but no inserts/deletes
– Can lock all related tuples ★Dynamic databases: – Can insert/delete tuples – Cannot lock all related
tuples • If insertions/deletions are allowed, then even Strict 2PL cannot assure serializability • Conflict
serializability guarantees serializability only if the set of objects is fixed!
★Predicate locking: aims to solve the problems of serializability in dynamic databases by locking not just
specific records, but all records that satisfy a certain predicate, both existing and those that might be
added in the future. 缺: Expensive
★Index locking: Locking database indices that correspond to the predicate. This approach can be
effective but may lock out more records than necessary, especially if the index covers a broad range of
values. More efficient than predicate locking.
Pessimistic vs Optimistic: • Two-phase locking (2PL) – Pessimistic approach – Assume txns will conflict
– Acquire locks on all items before accessing them!
• Timestamp ordering (T/O) – Optimistic approach – Assume that conflicts are rare – Do not acquire
locks, check for conflicts at commit
Optimistic control: The Kung-Robinson Model: Txns have three phases: ① READ: txns read from the
database, but make changes to private copies of objects. ② VALIDATE: Check for conflicts. ④ WRITE:
Make local copies of changes public.
★Validation: For validation, each transaction is given a
unique identifier, typically a timestamp, which is assigned
at the end of its read phase, just before validation begins.
Test conditions that are sufficient to ensure that no
conflict occurred. ① For all i and j such that Ti < Tj ,
check that Ti completes before Tj begins. ② For all i
and j such that Ti < Tj, check that: • Ti completes
before Tj begins its Write phase • WriteSet(Ti) ∩
ReadSet(Tj) is empty. ③ For all i and j such that Ti <
Tj, check that: • Ti completes read phase before Tj
does • WriteSet(Ti) ∩ ReadSet(Tj) is empty •
WriteSet(Ti) ∩ WriteSet(Tj) is empty. 二三为了检查 Does Tj read dirty data? Does Ti overwrite Tj’s
writes? 如果1过了, 就证明valid, 不用往下走了 . 如果1没过, 要进行第二步 , WriteSet(Ti前辈) ∩
ReadSet(Tj后来者) 是空则valid, 否则not valid重启Tj.
Assignment of txn id, validation, and the Write phase are inside a critical section! – Nothing else goes on
concurrently. – So, no need to check for Test 3 --- can’t happen.
★Overheads in Optimistic CC: ① Record read/write activity in ReadSet and WriteSet per txn. – Must
create and destroy these sets as needed. ② Check for conflicts during validation, and make validated
writes “global”. – Critical section can reduce concurrency.
③ Optimistic CC restarts txns that fail validation. – Work done so far is wasted. –Requires clean-up.
Parallel architectures: Shared
Memory is easy to use and
program, but very expensive to
scale. Shared Nothing is the
most scalable architecture
(lowest contention) but most
difficult to administer and tune.
Two-phase commit (2PC)
protocol: is an ACID-compliant
protocol
1. The coordinator sends a prepare message to each subordinate.
2. When a subordinate receives a prepare message. It force-writes an abort or prepare log record, and
then sends a no or yes message to the coordinator. Notice that a prepare log record is not used in a
centralized DBMS; it is unique to the distributed commit protocol.
3. If the coordinator receives yes messages from all subordinates, it force-writes a commit log record
and then sends a commit message to all subordinates. If it receives even one no message, or does not
receive any response from some subordinate for a specified time-out interval, it force-writes an abort log
record, and then sends an abort message to all subordinates (这一点取决于题 目, wherether nodes that
voted no are notified of the final decision, 如果不是就不用发 abort给 subordinates).
4. When a subordinate receives an abort message, it force-writes an abort log record, need not send an
ack message to the coordinator, aborts the subtransaction. When a subordinate receives a commit
messag, it force-writes a commit log record, sends an ack message to the coordinator, and commits the
subtransaction.
5. After the coordinator has received ack messages from all subordinates, it writes an end log record for
the transaction.
1-2 voting phase, 3-5 termination/complete/commit phase, both initiated by the coordinator. The basic
principle is that any of the transaction managers (领导+从属) involved can unilaterally abort a
transaction, whereas there must be unanimity一致同意 to commit a transaction.发送者always会在消息
发送前log decision to stable storage. A transaction is officially committed at the time the coordinator’s
commit log record reaches stable storage. Subsequent failures cannot affect the outcome of the
transaction.
★Presumed commit(左) & abort(右):
Restart After a Failure at a Site:
① If we have a commit/abort log rec for Xact T, but not an end rec, must redo/undo T.
– If coordinator, keep sending commit/abort msgs to subs until acks received. If subordinate, 直接
redo/undo ② If we have a prepare log rec for Xact T, but not commit/abort, this site is a subordinate for
T. – Repeatedly contact the coordinator to find status of T, then write commit/abort log rec; redo/undo T;
and write end log rec. ③ If we don’t have even a prepare log rec for T. If subordinate, unilaterally abort
and undo T. If coordinator, subordinates may send messages later.
★Coordinator failures & Blocking: ① If coordinator fails after sending prepare msgs but before writing
commit/abort log recs, when it comes back up it aborts the Xact. ② If coordinator for Xact T fails,
subordinates who have voted yes cannot decide whether to commit or abort T until coordinator recovers.
– T is blocked. – Even if all subordinates know each other, they are blocked unless one of them voted
no.
★Link and Remote Site Failures: If a remote site does not respond for Xact T, either because the site
failed or the link failed: ① If the current site is the coordinator for T, should abort T. ② If the current site
is a subordinate, and has not yet voted yes, it should abort T. ③ If the current site is a subordinate and
has voted yes, it is blocked until the coordinator responds.
★Refine: 1. If a subtransaction has not changed the database (which can be easily detected by
keeping a count of update log records), the subordinate can respond to a prepare message from the
coordinator with a reader message, instead of yes or no. The subordinate writes no log records in this
case.
2. When a coordinator receives a reader message, it treats the message as a yes vote, but with the
optimization that it does not send any more messages to the subordinate, because the subordinate’s
commit or abort status is irrelevant.
3. If all the subtransactions, including the subtransaction at the coordinator site, send a reader message,
we don’t need the termination phase. Indeed, we can simply remove the transaction from the transaction
table, without writing any log records at any site for this transaction.
例题: ★无论是是否 presumed aborts, yes都要正常来往 commit和ack, 4(N-1). 有任何一个说 no, 2PC正
常回复abort, 但不用等 ack, 3(N-1); presumes abort, 不用往来 abort和ack, 2(N-1); 如果说coordinator
notifies all subordinates of the final decision, 那答案就是 4(N-1)和3(N-1). ★① The coordinator fails after
sending the prepare messages. ② A subordinate fails before receiving a prepare message. ③
Assuming 2PC with Presumed Aborts: The coordinator aborts a transaction, but the abort message to a
subordinate is lost. ① All the subordinates are blocked. ② There is a timeout in the coordinator waiting
the failed subordinate and the transaction aborts. ③ The coordinator removes the transaction from the
transaction table. The subordinate that didn’t receive the abort message, re-tries to send yes or no. The
coordinator doesn’t find the transaction in the transaction table and replies abort.
Replication: ★Synchronous: All copies of a modified relation must be updated before the modifying Xact
commits. ① Voting: write a majority of copies; read enough copies to be sure of seeing at least one most
recent copy. Eg. 10 copies; 7 written for update; 4 copies read. 缺: Read multiple copies can be
resource-intensive and slow. ② Read-Any Write-All: All copies must be writte. 缺: write operations
slower because they only complete once all replicas have been updated. 优: A read operation can be
performed on any copy, making read operations faster because they can be completed as soon as a
single replica responds. ★Cost of Synchronous: Before an update Xact can commit, it must obtain locks
on all modified copies. – Sends lock requests to remote sites, and while waiting for the response, holds
on to other locks! – If sites or links fail, Xact cannot commit until they are back up. – Even if there is no
failure, committing must follow an expensive commit protocol with many msgs.
★Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies
may get out of sync in the meantime. ① Primary Site: Exactly one copy of a relation is designated the
primary or master copy. Replicas at other sites cannot be directly updated. – The primary copy is
published. – Other sites subscribe to this relation; these are secondary copies. ▌Log-Based Capture:
The log (kept for recovery) is used to generate a Change Data Table (CDT). It’s cheaper & faster, but
relies on proprietary log details. ▌Logical log-based capture(Middle-ground): Row-based replication:
Describe edits at row granularity. ▌Procedural Capture: Take a snapshot. ② Peer-to-Peer (multi-leader)
Replication: More than one of the copies of an object can be a master in this approach. Changes to a
master copy must be propagated to other copies somehow. Log-Based Capture plus continuous Apply
minimizes delay in propagating changes. Procedural Capture plus application-driven Apply is the most
flexible way to process changes.
MapReduce Distributed Query Processing: ★Declustering: ① Attribute-less partitioning, eg. random,
Round-Robin ② Single Attribute Schemes, eg. hash de-clustering, range de-clustering
③ Multiple Attributes schemes possible, eg. MAGIC, BERD
★Hash Declustering: 比如salary%3, 余数是0, 1, 2的分别分到三个 nodes里. – Selections with equality
predicates referencing the partitioning attribute are directed to a single node: salary = 60K. – Equality
predicates referencing a non-partitioning attribute and range predicates are directed to all nodes: age =
20, salary < 20K.
★Range Declustering: 例如salary 0-50K, 51K-100K, 101K-ꝏ分别分到 3个nodes里. – Equality and
range predicates referencing the partitioning attribute are directed to a subset of nodes: salary = 60K,
salary < 20K (both are directed to one node). – Predicates referencing a non-partitioning attribute are
directed to all nodes.
★Declustering Tradeoffs: • When selectivity is low, Range spreads out load and was ideal • When
selectivity is increased, Range causes high workload on one/few nodes while Hash spreads out load.
★Distributed Join: • Partition inputs to buckets; each bucket fits in join processors’ aggregate memory •
Partition and join each bucket pair across join processors
★Distributed Aggregation: Step 1: compute aggregate locally for each node. Step 2: redistribute by
hashing group attribute and aggregate partial results.
★Summary: • Distributed query execution is built on top of partitioning • Declustering can reduce
selection costs but can also lead to load skew • Partitioned operators focus on both parallelism and
memory requirements.
Map reduce: Distributed computation is usually simple but requires complex code to handle distribution,
fault tolerance, scheduling, partitioning. The MapReduce approach: code using functional model, hide
complexity behind a library.
Map(key, value): (key, value) → list((key, value)). Reduce(key, list(values)): (key, list(values)) →
(key, value)
缺: extensive IO ie. intermediate results are saved as regular files in the DFS (takes a lot of
time), shuffling consumes network and causes delay (can do progressive shuffling => start reducers
earlier), limited expressiveness (the way I wrote code is meh).
Tags