[Paper Reading] QAGen: Generating query-aware test databases

PingCAP-TiDB 113 views 18 slides Nov 05, 2021
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

Today, a common methodology for testing a database management system (DBMS) is to generate a set of test databases and then execute queries on top of them. However, for DBMS testing, it would be a big advantage if we can control the input and/or the output (e.g., the cardinality) of each individual ...


Slide Content

Query-Awared Database Generators

Background and Demands

How to simulate clients’ bussiness? Ideal circumstance: Database Schema + Real Data + SQL Sometimes Real Data is not available Alternatives: Database Schema + SQL + (Statistics?) Mock distribution/ domains for an attribution pk -> uniform distribution pk + fk -> Intergration Constriant Zipfian Enum Type (every low cardinality) Some SQL are too complex to understand: TPC… SSB… Heavy Anylitical Tasks We need a query-aware generator

Hassles of Making a Generator For simple query, the solution is simple and direct select [cols...] from t1 join t2 on t1.pk = t2.fk What if we need to simulate some correlation where a < ‘xx’ and b > ‘yy’ Or proccess complex projection select case when cond1 then ***, case when cond2 And the combination of above circumstances….

Two papers with two different methods QAGen DGL

QAGen: Generating query-aware test databases

Symbolic Execution Volcano mode iterator Data are symbols instead of real data Generate constraints during execution

Symbolic Execution Volcano mode iterator Data are symbols instead of real data Generate constraints during execution

Agenda of QAGen Analyze the Query && Find out the Available “Knob”s A knob of operator is the cardinality of the output of the operator or the Distribution if pre-grouped Do the Symbolic Execution (and Get the Constraint of the Data) Instantiating the Data Always with expensive cost

Analyze the Query

Symbolic Execution - Selection GetNext for child, read tuple “t” [Positive Tuple Annotation] if output not reach c for each symbol s in t, insert <s, p> return t to parent [Negative Tuple Processing] insert <s, !p>

Symbolic Execution - Join Generate Join Distribution GetNext for (outer) child, read tuple “t” [Positive Tuple Annotation] replace the fk with pk [Negative Tuple Processing]

Instantiazing A typical process of model checking. Low efficiency

Do we need a state-of-the-art generator? Pros: without meddling, we can definitely get effective data work for complex quries that are not understandable perfectly guarantee constriants Cardinality constriants Integretry constriants Cons: low efficiency some cases still can’t be processed How about introducing an intermediate DSL

DGL: Flexible Database Generators

Another alternative What if we have understood the workload pretty well? Then use a more flexiable DSL instead of coding for every workload? DGL (Data Generate Language): Scalar Rows Iterator Distribution Real SQL (Query / Persist) Tables Expression / Functions

An example of TPC-H

Thanks