A Brief Overview of (Static) Program Query Languages

KimMens 16 views 43 slides Apr 29, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

A brief introduction to some Program Query Languages and tools, part of a larger course on Programming Paradigms, taught at UCLouvain university in Belgium by Prof. Kim Mens.


Slide Content

Program Query Languages
Kim Mens

April 2024

Definition
A Program Query Language (PQL)

is a language or tool for querying information
from some program representation:
program source code text
binary code
execution traces
abstract syntax trees
control-flow graphs
data-flow graphs
These queries can reason about:
program structure

coding idioms, control flow,

class hierarchies, …
program behaviour

function calls,

variable accesses, …
patterns that indicate
potential issues such as

code smells, vulnerabilities

Purpose impact
program
checking

Maintenance : detect issues in the code that may require refactoring such as bad
smells, duplicated code, deprecated API usage
Testing : identify specifi
Verifi
Debugging : identify problematic code patterns (anti-patterns, known bugs) that
might lead to bugs or vulnerabilities
Teaching : provide personalised feedback to students by identifying issues,
suggesting improvements, or highlight excellence in their code Relevance for Software Quality Assurance

Static vs. Dynamic PQLs
Static Program Query Languages
(SPQL)
reason over a static representation of
software programs
analyse the program without executing it
mostly focus on structural issues (but to
a limited extent behavioural ones too)
to improve code quality, find possible
bugs, detect security vulnerabilities, or
violations of coding standards
Dynamic Program Query Languages
reason over a dynamic program
representation, obtained through code
execution
might not cover all possible execution
traces
mostly focus on behavioural issues that
only manifest at runtime
to monitor system behaviour, memory
usage, and performance issues

(Some) Static PQLs
PQL Program Representation Program Queries
CodeQL
Code Property Graph : a graph database that combines
elements of Abstract Syntax Trees (ASTs), Control Flow
Graphs (CFGs), and data dependency graphs.
SQL-like language to query CodeQL's
Code Property Graph
XPath
XQuery
XML documents
(e.g. srcML for source code)
Queries that navigate the hierarchical
structure of an XML document
Regular
Expressions
Text
(e.g. program source code)
Matching strings based on specific
patterns
AST traversal
Python source code parsed into an AST
(e.g., using Python’s built-in AST module;

or using astroid library for AST parsing,

static analysis and type inference)
Parse tree visitors / walkers
PyLint

(or other “lint” tools)
Predefined rules + possibility to create
your own customized rules (e.g., by
writing your own parse tree visitor)
Pyttern
Pyttern patterns, a Python-like
language with special wildcards
Logic
Metaprogramming
Code as logic facts extracted from the (parse tree)
structure of the source code
Logic queries

Coding Patterns
Coding Idioms
Coding Flaws

Coding Idioms
Conventional way of accomplishing particular programming tasks

Using language-specific features and best practices

to achieve clear, efficient and simple code

Not random patterns but generally recognised and widely adopted

solutions among experienced programmers within the community

Differ from design patterns in their focus on language-specific solutions

to small-scale tasks or problems, rather than architectural issues

Coding Flaws
Broad spectrum of errors, inadequacies, or instances of poor practice
within program code
Not only syntax errors but also more subtle problems such as poor code
smells, problematic coding idioms, anti-patterns, and practices stemming
from misconceptions held by novices
Highlight areas in code that, while not necessarily leading to immediate
execution errors, represent a departure from best coding practices
Signal potential underlying problems in program logic, structure, readability
or maintainability
May indicate a deeper lack of understanding or proficiency in coding
principles and programming language concepts

Accumulator Pattern
Common coding idiom to accumulate
sequence of values into single result
Using for-loop to accumulate results in
a variable as it iterates through the
items
Typical use case is summing numbers
Can be adapted for other operations:
product accumulation,

concatenating a string, …
def accumulate(lst) :
accum = 0
for item in lst :
accum += item
return accum
>>> accumulate([1,2,3,4])
10

Accumulator Pattern
“Write a program that
approximates the value of π using
the Gregory-Leibniz series”
Expected solution makes use of a
variant of the accumulator pattern
Code on the right is flawed (why?)
Writing a pattern that detects the
accumulator pattern can help
identifying solutions that don’t
def approx_pi(n):
var = 0
if n > 0 :
for i in range(n):
var = (-1)**i / (2*i + 1)
var *=4
return var

CodeQL
“Code Query Language”
Formerly known as SemmleQL
Code analysis tool currently maintained by Microsoft's GitHub
Treats code as data by converting the code into database of facts and
relationships
Allows developers to write declarative SQL-like queries over this database
Mostly used for security checks to find potential bugs and vulnerabilities in code

CodeQL
“Find all if-statements in a C++ program that directly contain a return statement”






CodeQL allows to express sophisticated analyses of program behaviour in a way
that is both powerful and “relatively” accessible to those familiar with SQL.
import cpp
from IfStmt ifstmt, ReturnStmt ret
where ifstmt.getThen() = ret.getParent()
select ifstmt, "This 'if' statement directly contains a 'return'."
import CodeQL library to analyse C++ code
declare variables for if- and return-statements
condition to match: return statements
whose parent is body of an if-statement.
defines what the query will return
+ message indicating what the if-statement does

CodeQL accumulator pattern
import python
from Function f,
Name n1,
Name n2,
Name n3,
Assign a,
File file,
Module m,
Return r,
For for,
AugAssign aa,
Variable v
where f.contains(a)
and a.getTarget(0) = n1
and a.getValue() instanceof Num
and n1.getVariable() = v
and m.contains(f)
and m.getFile() = file
and f.contains(for)
and for.contains(aa)
and aa.getTarget() = n2
and n2.getVariable() = v
and f.contains(r)
and r.getValue() = n3
and n3.getVariable() = v
select file, f, v, a, aa, r, "Assignment of a number to a variable"





arguably not the most “readable”

Regular Expressions
A “regex” is a sequence of characters that defines a search pattern
Can be used for text processing and parsing & string searching
Can be used as basic PQL by pattern matching over source code text
Only works on a textual level; no explicit representation of underlying code
structure => more limited analysis capabilities
The re module in Python is a powerful library for string manipulation using
regular expressions

regex example
def append_to(element, to=[]) :

to.append(element)

return to
append_to(5)

append_to(6)

append_to(7)

list = append_to(8)
>>> print(list)

[5, 6, 7, 8] When append_to is called multiple times without specifying
the second argument, it keeps accumulating elements from
all the calls, which is often not the intended behavior.
Practical example of using regex to
analyse Python source code:
Detecting pattern of default arguments in
function definitions.
Common stylistic breach/bug in Python
code, as default arguments retain their
state between function calls, which may
lead to unexpected behavior.

regex example
To detect this pattern, we can write a regex that matches function definitions with default
arguments:
def\s+\w+(.*=\s*(\[\]|\{\}|set())
Let's break down this regular expression:


def\s+\w+( matches start of function definition, including def followed by one or more
whitespace characters, followed by one or more word characters (function name), and an opening
parenthesis.
.*=\s* matches zero or more of any character (non-greedy) followed by an equals sign =,
followed by zero or more whitespace characters. This part looks for assignment of the default
value
(\[\]|\{\}|set()) matches empty list [], empty dictionary {}, or empty set set() as default argument.
def append_to(element, to=[]

regex example
import re
# Sample Python code
python_code = """
def append_to_list(item, lst=[]):
lst.append(item)
return lst
def add_to_set(element, s=set()):
s.add(element)
return s
def add_to_set(): # no default =[] here
s.add(element)
return s
"""
# Regular expression to find mutable default arguments
pattern = r"def\s+\w+(.*=\s*(\[\]|\{\}|set(\
# Find matches in the sample code
matches = re.finditer(pattern, python_code)
# Print matches
for match in matches:
print("Found a function with a mutable default argument:", match.group())
Found a function with a mutable default argument: def append_to_list(item, lst=[]
Found a function with a mutable default argument: def add_to_set(element, s=set()
Found a function with a mutable default argument: def add_to_set(): # no default =[]
Beware of false positives !






Not very readable and still doesn’t match the intended pattern exactly

(Why not? Can you improve it?)
regex accumulator pattern
def\s+(?P<fun_name>.*)\(.*\):
\s*(?P<var>.*)\s*=\s*\d+(?:|.)*
for.*:(?:|.)*
(?P<var>)\s*\+=\s*\d(?:|.)*
return.*(?P<var>).*

def accumulate(lst) :
accum = 0
for item in lst :
accum += item
return accum
>>> accumulate([1,2,3,4])
10

AST
x = x + 1 

print("Hello World")
Module
List
body
Assign
Name
BinOpList
ConstantAddName
0
targets value
leftopright0
Expr
Call
Constant
Name
List
1
value
func args
0
types_ignore
type_comment
x Store x Load
keywords
List
1
id
ctx idctx id
kind
print Store
idctx
"Hello
World"
id
kind

AST Traversal with Visitor DP
Node
FunctionDef Assign AugAssign
For Return …
NodeVisitor
generic_visit(Node)
Accumulator
visit_FunctionDef(FunctionDef)

visit_Assign(Assign)
visit_AugAssign(AugAssign)
visit_For(For)
visit_Return(Return)
...
Hierarchy of AST nodes to be visitedSpecific visitor per pattern to be detected





Well-known design pattern

but requires quite some “boilerplate” code to be written

AST accumulator pattern
import ast
class Accumulator(ast.NodeVisitor):
def __init__(self) -> None:
super().__init__()
self._function_stack = []
self._saved_assign = []
self._in_loop = False
self._has_acc = True
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._function_stack.append([])
self._saved_assign.append([])
self.generic_visit(node)
self._function_stack.pop()
self._saved_assign.pop()
Using Python’s AST module

AST accumulator pattern
def visit_Assign(self, node: ast.Assign) -> None:
for target in node.targets:
if isinstance(target, ast.Name):
if isinstance(node.value, ast.Constant):
self._function_stack[-1].append(target.id)
self.generic_visit(node)
def visit_AugAssign(self, node: ast.AugAssign) -> None:
if self._in_loop:
if node.target.id in self._function_stack[-1]:
self._saved_assign[-1].append(node.target.id)
self.generic_visit(node)

AST accumulator pattern
def visit_For(self, node: ast.For) -> None:
self._visit_loop(node)
def visit_While(self, node: ast.For) -> None:
self._visit_loop(node)
def _visit_loop(self, node) -> None:
if not self._in_loop:
self._in_loop = True
self.generic_visit(node)
self._in_loop = False
else:
self.generic_visit(node)

AST accumulator pattern
def visit_Return(self, node: ast.Return) -> None:
print(self._saved_assign[-1])
if not isinstance(node.value, ast.Name):
self._has_acc = False
elif node.value.id not in self._saved_assign[-1]:
self._has_acc = False
self.generic_visit(node)
with open(“…/buggy-code/source/factor_list.py”) as file:
tree = ast.parse(file.read(), file.name)
visitor = Accumulator()
visitor.visit(tree)
if not visitor._has_acc:
print("miss Accumulator")
else:
print("Accumulator")

PyLint
PyLint is a widely adopted static code analysis tool for Python projects
Focus on code style and quality issues: to enforce coding standards, look
for code smells, and can make suggestions about where/how to refactor
Easy to install and integrated in many Python IDEs
Comes with numerous pre-existing checks, is highly configurable and
offers customizable linting rules for code quality checks.
Permits to define your own custom checks by writing an AST visitor.

Pylint accumulator pattern
import astroid
from astroid import nodes
from typing import TYPE_CHECKING, Optional
from pylint.checkers import BaseChecker
if TYPE_CHECKING:
from pylint.lint import PyLinter
class AccumulatorChecker(BaseChecker):
name = "no-acculumator"
msgs = {
"W0002": (
"Doesn't use an accumulator variable.",
"no-accumulator",
"All functions should use an accumulator variable.",
),
}
...
Using the astroid AST module

Pylint accumulator pattern
...
def __init__(self, linter: Optional["PyLinter"] = None) -> None:
super().__init__(linter)
self._function_stack = []
self._saved_assign = []
def visit_functiondef(self, node: nodes.FunctionDef) -> None:
print("visit_functiondef")
self._function_stack.append([])
self._saved_assign.append([])
def leave_functiondef(self, node: nodes.FunctionDef) -> None:
self._function_stack.pop()
self._saved_assign.pop()
def visit_assign(self, node: nodes.Assign) -> None:
for target in node.targets:
if isinstance(target, nodes.AssignName):
if isinstance(node.value, nodes.Const):
self._function_stack[-1].append(target.name)
...

...
def visit_augassign(self, node: nodes.AugAssign) -> None:
in_loop = False
for ancestor in node.node_ancestors():
if isinstance(ancestor, (nodes.For, nodes.While)):
in_loop = True
break
if isinstance(ancestor, nodes.FunctionDef):
break
if in_loop:
if node.target.name in self._function_stack[-1]:
self._saved_assign[-1].append(node.target.name)
def visit_return(self, node: nodes.Return) -> None:
if not isinstance(node.value, nodes.Name):
self.add_message("no-accumulator", node=node)
else:
if node.value.name not in self._saved_assign[-1]:
self.add_message("no-accumulator", node=node)
Pylint accumulator pattern

Logic Metaprogramming (LMP)
Use of a logic programming language (e.g., Prolog) as SPQL
Operating on a representation of the program's abstract syntax tree (AST)
reified as facts and rules in a logic knowledge base
To declare logic predicates that reason over the program structure
Making use of the full power of logic programming (resolution, unification,
recursion, …) to express patterns in a high-level, declarative manner

Logic Metaprogramming (LMP)
Represent all nodes in the program AST as logical facts in the following format :
node_def(Type, Children, ValueTypes)


node_val(Type, Node_Id, Parent_Id, Child_Ids, Values)
Example :
node_def(call,[args,func,keywords],[lineno,col_offset,end_col_offset,end_lineno])


node_val(call,12,11,[[13],15,[]],[2,0,21,2])
type of AST node : 

functiondef, assign, constant, for, call, …
unique ID for this

particular node
ID of this node’s

parent node

Logic Metaprogramming (LMP)
node_def(call,[args,func,keywords],[lineno,col_offset,end_col_offset,end_lineno])


node_val(call,12,11,[[13],15,[]],[2,0,21,2])

node_def node_val
node_def(module,[body,type_ignores],[])
node_def(assign,[value,targets,type_comment],
[lineno,col_offset,end_col_offset,end_lineno])
node_def(binop,[op,left,right],
[lineno,col_offset,end_col_offset,end_lineno])
node_def(add,[],[])
node_def(name,[ctx],
[id,lineno,col_offset,end_col_offset,end_lineno])
node_def(load,[],[])
node_def(constant,[kind],
[value,s,n,lineno,col_offset,end_col_offset,end_lineno])
node_def(null,[],[])
node_def(store,[],[])
node_def(expr,[value],
[lineno,col_offset,end_col_offset,end_lineno])
node_def(call,[args,func,keywords],
[lineno,col_offset,end_col_offset,end_lineno])
node_val(add,3,2,[],[])
node_val(load,5,4,[],[])
node_val(name,4,2,[5],[x,1,4,5,1])
node_val(null,7,6,[],[])
node_val(constant,6,2,[7],[1,1,1,1,8,9,1])
node_val(binop,2,1,[3,4,6],[1,4,9,1])
node_val(store,9,8,[],[])
node_val(name,8,1,[9],[x,1,0,1,1])
node_val(null,10,1,[],[])
node_val(assign,1,0,[2,[8],10],[1,0,9,1])
node_val(null,14,13,[],[])
node_val(constant,13,12,[14],[Hello World!,Hello
World!,Hello World!,2,6,20,2])
node_val(load,16,15,[],[])
node_val(name,15,12,[16],[print,2,0,5,2])
node_val(call,12,11,[[13],15,[]],[2,0,21,2])
node_val(expr,11,0,[12],[2,0,21,2])
node_val(module,0,-1,[[1,11],[]],[])
x = x + 1

print("Hello World”)

Logic Metaprogramming (LMP)
Additional predicates are predefined to avoid using this representation directly.
For example, id_type(Id,‘functiondef’) finds all Ids of function definition nodes
and info(Id, 'name', Name) yields the value of the name attribute of the node Id.
id_type(?Id, ?Type) Type is the AST node type of the AST node with identifier Id
info(+Id, +Name, -Info) gets Info for the attribute Name of the node with given Id
info(+Type, ?Id, +Name, -Info)
same as above but guarantees that the node is of given Type
children(-Children, +Id) Children is list of identifiers of the children of the node with given Id
parent(-Parent_Id, +Id) Parent_Id is the identifier of the parent of the node with identifier Id
descendant(?Child_Id, ?Parent_Id) node Child_ID is (direct of indirect) descendant of node Parent_ID

Logic Metaprogramming (LMP)
In terms of which yet more readable predicates can de defined.
ast_functiondef(Id) :- id_type(Id, ‘functiondef’).

LMP example
Predicate to find Name of each function that contains a list of at least three hard-coded
constants inside.
contains_hardcoded list(-Name) :-

id_type(List_Id, 'list'),

info(List_Id, 'elts', Children_List),

findall( Child,

(member(Child, Children_List), id_type(Child,'constant')),

Constant_Children),

length(Constant_Children, N_Constant), N_Constant >= 3,

descendant(List_Id, _, Function_Id, 'functiondef'),

info(Function_Id, 'name', Name).
This is a (bad) pattern that some students try to use to
“trick” an autograder by hardcoding rather than
calculating expected solutions to an exercise.

Pyttern
A program query language that is easy to learn, read and write for non-expert
developers or software engineers
A simple query language for Python programs with a syntax that remains close
to Python
essentially Python syntax extended with wildcards (representing “holes” in
the program)
inspired by regex and SCRUPLE
yet allows for the easy declaration and detection of sufficiently complex coding
idioms

Pyttern Wildcards
Wildcard Meaning
? Matches a single node in the AST
?* Matches zero or more nodes in an AST list node
?{name} Matches a single node and names it for later reference
?: Matches the body of this wildcard at one nesting level
?*: Matches the body of this wildcard at any nesting level

Pyttern Python
def foo(?*):
?*
?*:
?*
for ? in range(?*):
?*
if ? == 0:
?*
?*
?*
?*
def foo(lst):
counter = 0
for i in range(len(lst)):
if lst[i] == 0:
counter += 1
return counter

Pyttern Python
function_def
[list]
body
any_wildcard
0
any_compound_wildcard
1
any_wildcard
2
[list]
body
any_wildcard
0
for
1
any_wildcard
2
simple_wildcard
target
call
iter
[list]
body
any_wildcard
0
if
1
any_wildcard
2
body
0
1
targetiterbody
id 0
function_def
[list]
assign
for
name call
[list]
i if
def foo(?*):
?*
?*:
?*
for ? in range(?*):
?*
if ? == 0:
?*
?*
?*
?*
def foo(lst):
counter = 0
for i in range(len(lst)):
if lst[i] == 0:
counter += 1
return counter
value
id
return
name
counter
2
... ...
a) Simplified version of a pattern AST b) Simplified version of a python AST
foo
name
[list]
any_wildcard
args
0
foo [list]
argsname
0
lst
... ...

Pyttern Python



def ?(?*):
?accumulator = 0
for ?:
?accumulator += ?
return ?accumulator



def accumulate(lst):
accum = 0
for item in lst :
accum += item
return accum

Pyttern accumulator pattern


def ?(?*):
?accumulator = 0
for ?:
?accumulator += ?
return ?accumulator



def accumulate(lst):
accum = 0
for item in lst :
accum += item
return accum
Pyttern Python

Limitations of (some) PQLs
False positives:

reporting a problem that isn't one
False negatives:

failing to detect actual problems
Complexity/usability:

limited adoption due to required expertise

and steep learning curve
Depth of analysis:

limited expresiveness to declare

more than just superficial patterns
Performance:

too computationally intensive; no immediate

feedback when analysing large codebases

Description of the Experiment
Numerous program querying tools and languages exist that can be used
for detecting structural patterns in program source code
Some of them are more complex to use than others or have a steeper
learning curve
Some of them are more expressive (in the kind of patterns they can
detect) than others
We need your help to compare the relative expressiveness and usability
from different points of view of various existing program query languages