TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow
Vellvm - Verifying the LLVM
Steve Zdancewic (Professor, USA University of Pennsylvania)
For video follow the link: https://youtu.be/jDPAtUfnoBU
Would like to know more?
Visit our website:...
TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow
Vellvm - Verifying the LLVM
Steve Zdancewic (Professor, USA University of Pennsylvania)
For video follow the link: https://youtu.be/jDPAtUfnoBU
Would like to know more?
Visit our website:
www.tmpaconf.org
www.exactprosystems.com/events/tmpa
Vellvm: Verifying the LLVM IR
Steve Zdancewic
University of Pennsylvania
TMPA 2017
Collaborators
• Jianzhou Zhao
• Dmitri Garbuzov
• William Mansky
• ChrisGne Rizkallah
• Richard Zhang
• Milo M.K. MarGn
• Santosh NagarakaLe
• Gil Hur
• Jeehon Kang
• Viktor Vafeiadis
The Science of Deep SpecificaGon
• Andrew Appel (Princeton)
• Adam Chlipala (MIT)
• Zhong Shao (Yale)
• Benjamin Pierce (U. Penn.)
• Stephanie Weirich (U. Penn.)
The Need For High Assurance So[ware
4
heartbleed
car hacking
stuxnet buffer overflow aLacks
Deep SpecificaGons
• Rich – expressive descripGon
• Formal – mathemaGcal, machine-checked
• 2-Sided – tested from both sides
• Live – connected to real, executable code
Goal: Advance the reliability, safety, security, and
cost-effecGveness of so[ware (and hardware).
The Coq InteracGve Theorem Prover
• Based on dependent type theory
• Pure funcGonal language + datatypes
• ConstrucGve proofs ⇒ executable code
• AutomaGon: tacGcs + inference
⇒ formalizaGon tool of choice for DeepSpec team
[Developed at INRIA]
DeepSpec: InterconnecGons
DeepSpec: InterconnecGons
LLVM: Low-Level Virtual Machine
LLVM Compiler Infrastructure
LLVM
Front
Ends
Code
Gen/Jit
OpGmizaGons/
TransformaGons
Typed SSA
IR
Analysis
[LaLner et al. ]
MoGvaGon: So[Bound/CETS
• Buffer overflow vulnerabiliGes.
• Detect spaGal/temporal memory
safety violaGons in legacy C code.
• Implemented as an LLVM pass.
• What about correctness?
[NagarakaLe, et al. PLDI ’09, ISMM ‘10]
hLp://www.cis.upenn.edu/acg/so[bound/
InspiraGon: CompCert
12
[Xavier Leroy INRIA Rocquencourt]
OpGmizing C Compiler:
proved correct end-to-end
with machine-checked proof in Coq
C language
CompCert
Compiler
ISA
rich, formal,
2-sided, live
Does Such VerificaGon Work?
LLVM
Csmith Tool
Random
test-case generaGon
{8 other C compilers}
+
CompCert
79 bugs
(25 criGcal)
202 bugs
325 bugs in
total
Source
Programs
[Yang et al. PLDI 2011]
YES! VerificaGon Works
"The striking thing about our CompCert results is that the
middle-end bugs we found in all other compilers are absent.
As of early 2011, the under-development version of CompCert
is the only compiler we have tested for which Csmith cannot
find wrong-code errors. This is not for lack of trying: we have
devoted about six CPU-years to the task. The apparent
unbreakability of CompCert supports a strong argument that
developing compiler opEmizaEons within a proof framework,
where safety checks are explicit and machine-checked, has
tangible benefits for compiler users."
– Regehr et. al 2011
LLVM Compiler Infrastructure
LLVM
Front
Ends
Code
Gen/Jit
OpGmizaGons/
TransformaGons
Typed SSA
IR
Analysis
[LaLner et al. ]
LLVM Compiler Infrastructure
LLVM
Front
Ends
Code
Gen/Jit
OpGmizaGons/
TransformaGons
Typed SSA
IR
Analysis
[LaLner et al.]
The Vellvm Project
OpGmizaGons/
TransformaGons
Typed SSA
IR
Analysis
• Formal semanGcs
• FaciliGes for creaGng
simulaGon proofs
• Implemented in Coq
• Extract passes for use
with LLVM compiler
• Example: verified
memory safety
instrumentaGon
[Zhao et al. POPL 2012, CPP 2012, PLDI 2013]
Vellvm Framework
Transform
C Source
Code
Other
OpGmizaGons
LLVM
IR
LLVM
IR
Target
LLVM
OCaml Bindings
Printer Parser
Coq
Syntax
OperaGonal
SemanGcs
Memory
Model
Type System
and SSA
Proof Techniques & Metatheory
Extract
Vellvm Framework
C Source
Code
Other
OpGmizaGons
LLVM
IR
LLVM
IR
Target
LLVM
OCaml Bindings
Printer Parser
Coq
Syntax
OperaGonal
SemanGcs
Memory
Model
Type System
and SSA
Proof Techniques & Metatheory
Extract
Verified
Transform
Plan
• Tour of the LLVM IR
• Vellvm infrastructure
– OperaGonal SemanGcs
– SSA Metatheory + Proof Techniques
• Case studies:
– So[Bound memory safety
– mem2reg
• Conclusion
LLVM IR by Example
entry:
r
0
= ...
r
1
= ...
r
2
= ...
Control-flow Graphs:
+ Labeled blocks
exit:
r
7
= ...
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
loop:
r
3
= ...
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
Other Parts of the LLVM IR
28
op ::= %uid | constant | undef Operands
bop ::= add | sub | mul | shl | … OperaEons
cmpop::= eq | ne | slt | sle | … Comparison
insn ::=
|%uid = alloca ty Stack AllocaEon
|%uid = load ty op1 Load
|store ty op1, op2 Store
|%uid = getelementptr ty op1 … Address CalculaEon
|%uid = call rt fun(…args…) FuncEon Calls
|…
phi ::=
|φ[op1;lbl1]...[opn;lbln]
terminator ::=
|ret %ty op
| br op label %lbl1, label %lbl2
|br label %lbl
Plan
• Tour of the LLVM IR
• Vellvm infrastructure
– OperaGonal SemanGcs
– SSA Metatheory + Proof Techniques
• Case studies:
– So[Bound memory safety
– mem2reg
• Conclusion
LLVM IR SemanGcs
SSA CFG ≈ funcGonal program
+
• Types & Memory Layout
– structured, recursive types
– type-directed projecGon
– type casts
• Effects
– structured heap load/store
– system calls (I/O)
– nondeterminism
[Appel 1998]
We know how
to model this
and prove
properGes about
the models.
LLVM’s memory model
• Manipulate structured types.
%ST = type {i10,[10 x i8*]}
i10
i8*
i8*
i8*
i8*
i8*
i8*
i8*
i8*
i8*
i8*
High-level
RepresentaGon
%val = load %ST* %ptr
…
store %ST* %ptr, %new
Target-dependent Results
Sources of Undefined Behavior
• UniniGalized variables:
• UniniGalized memory:
• Ill-typed memory usage
%v = add i32 %x, undef
%ptr = alloca i32
%v = load (i32*) %ptr
Nondeterminism Stuck States
Stuck(f, σ) = BadFree(f, σ)
˅ BadLoad(f, σ)
˅ BadStore(f, σ)
˅ …
˅ …0
Defined by a predicate on
the program configuraGon.
undef
• What is the value of %y a[er running the following?
• One plausible answer: 0
• Not LLVM’s semanGcs!
(LLVM is more liberal to permit more aggressive opGmizaGons)
%x = or i8 undef, 1
%y = xor i8 %x %x
undef
• ParGally defined values are interpreted
nondeterminisEcally as sets of possible values:
⟦%x⟧= {a or b | a∈⟦i8 undef⟧, b ∈⟦1⟧}
= {1,3,5,…,255}
⟦%y⟧ = {a xor b | a∈⟦%x⟧, b∈⟦%x⟧}
= {0,2,4,…,254}
%x = or i8 undef, 1
%y = xor i8 %x %x
⟦i8 undef⟧ = {0,…,255}
⟦i8 1⟧ = {1}
LLVM
ND
OperaGonal SemanGcs
• Define a transiGon relaGon:
f ⊢ σ
1
⟼ σ
2
– f is the program
– σ is the program state: pc, locals(δ), stack, heap
• NondeterminisGc
– δ maps local %uids to sets.
– Step relaGon is nondeterminisGc
• Mostly straigh~orward (given the heap model)
– One wrinkle: phi-nodes exectuted atomically
OperaGonal SemanGcs
Small Step Big Step
NondeterminisGc
DeterminisGc
LLVM
ND
DeterminisGc Refinement
Small Step Big Step
NondeterminisGc
DeterminisGc
LLVM
ND
LLVM
D
∋︎
InstanGate ‘undef’ with default value (0 or null) ⇒ determinisGc.
Big-step DeterminisGc Refinements
Small Step Big Step
NondeterminisGc
DeterminisGc
LLVM
ND
LLVM
D
LLVM
Interp ≈︎
∋︎
BisimulaGon up to “observable events”:
• external funcGon calls
Big-step DeterminisGc Refinements
[Tristan, et al. POPL ’08, Tristan, et al. PLDI ’09]
Small Step Big Step
NondeterminisGc
DeterminisGc
LLVM
ND
LLVM
D
LLVM
*
DFn
LLVM
*
DB
LLVM
Interp ≈︎≿︎ ≿︎
∋︎
SimulaGon up to “observable events”:
• useful for encapsulaGng behavior of funcGon calls
• large step evaluaGon of basic blocks
A Taste of Coq FormalizaGon
…
Plan
• Tour of the LLVM IR
• Vellvm infrastructure
– OperaGonal SemanGcs
– SSA Metatheory + Proof Techniques
• Case studies:
– So[Bound memory safety
– mem2reg
• Conclusion
Reasoning About LLVM Code
How do we prove that a program
transformaGon is correct with respect to the
defined operaGonal semanGcs?
• Safety Invariants (preservaGon and progress)
• SimulaGon techniques
Key SSA Invariant
entry:
r
0
= ...
r
1
= ...
r
2
= ...
br r
0
loop exit
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
DefiniGon of r
2
.
Use of r
2
. Uses of r
2
.
Key SSA Invariant
entry:
r
0
= ...
r
1
= ...
r
2
= ...
br r
0
loop exit
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
DefiniGon of r
2
.
Use of r
2
. Uses of r
2
.
The definiGon of a
variable must dominate
its uses.
Safety ProperGes
• A well-formed program never accesses undefined variables.
• Ini=aliza=on:
• Preserva=on:
• Progress:
If ⊢ f and f ⊢ σ
0
⟼* σ then σ is not stuck.
⊢ f program f is well formed
σ program state
f ⊢ σ ⟼* σ evaluaGon of f
If ⊢ f then wf(f, σ
0
).
If ⊢ f and f ⊢ σ ⟼ σ’ and wf(f, σ) then wf(f, σ’)
If ⊢ f and wf(f, σ) then f ⊢ σ ⟼ σ’
Safety ProperGes
• A well-formed program never accesses undefined variables.
• Ini=aliza=on:
• Preserva=on:
• Progress:
If ⊢ f and f ⊢ σ
0
⟼* σ then σ is not stuck.
⊢ f program f is well formed
σ program state
f ⊢ σ ⟼* σ evaluaGon of f
If ⊢ f then wf(f, σ
0
).
If ⊢ f and f ⊢ σ ⟼ σ’ and wf(f, σ) then wf(f, σ’)
If ⊢ f and wf(f, σ) then done(f,σ) or stuck(f,σ) or f ⊢ σ ⟼ σ’
Well-formed States
entry:
r
0
= ...
r
1
= ...
r
2
= ...
br r
0
loop exit
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
pc
State σ is:
pc = program counter
δ = local values
Well-formed States
entry:
r
0
= ...
r
1
= ...
r
2
= ...
br r
0
loop exit
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
pc
State σ is:
pc = program counter
δ = local values
sdom(f,pc) = variable defns.
that strictly dominate pc.
Well-formed States
entry:
r
0
= ...
r
1
= ...
r
2
= ...
br r
0
loop exit
exit:
r
7
= φ[0;entry][r
5
;loop]
r
8
= r
1
x r
2
r
9
= r
7
+ r
8
ret r
9
loop:
r
3
= φ[0;entry][r
5
;loop]
r
4
= r
1
x r
2
r
5
= r
3
+ r
4
r
6
= r
5
≥ 100
br r
6
loop exit
pc
State σ contains:
pc = program counter
δ = local values
sdom(f,pc) = variable defns.
that strictly dominate pc.
wf(f,σ) =
∀r∊sdom(f,pc). ∃v. δ(r) = ⎣v⎦
“All variables in scope
are iniGalized.”
Generalizing Safety
• DefiniGon of wf:
• Generalize like this:
• Methodology: for a given P prove three theorems:
IniEalizaEon(P)
PreservaEon(P)
Progress(P)
wf(f,(pc, δ)) = ∀r∊sdom(f,pc). ∃v. δ(r) = ⎣v⎦
wf(f,(pc, δ)) = P f (δ|
sdom(f,pc)
)
where P : Program ⟶ Locals ⟶ Prop
Consider only variables in
scope ⇒ P defined
relaGve to the dominator
tree of the CFG.
InstanGaGng
• For usual safety:
• For semanGc properGes:
• Useful for verifying correctness of:
– code moGon, dead variable eliminaGon, common
expression eliminaGon, etc.
P
safety
f δ = ∀r∊dom(δ). ∃v. δ(r) = ⎣v⎦
P
sem
f δ = ∀r. f[r] = ⎣rhs⎦ ⇒ δ(r) = ⟦rhs⟧
δ
Plan
• Tour of the LLVM IR
• Vellvm infrastructure
– OperaGonal SemanGcs
– SSA Metatheory + Proof Techniques
• Case studies:
– So[Bound memory safety
– mem2reg
• Conclusion
So[Bound
So[Bound
C Source
Code
Other
OpGmizaGons
LLVM
IR
LLVM
IR
Target
• Implemented as an LLVM pass.
• Detect spaGal/temporal memory
safety violaGons in legacy C code.
• Good test case:
– Safety CriGcal ⇒ Proof cost warranted
– Non-trivial Memory transformaGon
So[Bound
So[Bound
C Source
Code
Other
OpGmizaGons
LLVM
IR
LLVM
IR
Target
%p = call malloc [10 x i8]
%q = gep %p, i32 0, i32 255
store i8 0, %q
%p = call malloc [10 x i8]
%p_base = gep %p, i32 0
%p_bound = gep %p, i32 0, i32 10
%q = gep %p, i32 0, i32 255
%q_base = %p_base
%q_bound = %p_bound
assert %q_base <= %q
/ %q+1 < % q_bound
store i8 0, %q
Maintain base and bound for all pointers
Propagate metadata on assignment
Check that a pointer is within its
bounds when being accessed
Disjoint Metadata
• Maintain pointer bounds in a separate memory space.
• Key Invariant: Metadata cannot be corrupted by bounds
violaGon.
User memory
Disjoint metadata
%p %p
base
%p
bound
%i
1
%q %q
base
%q
bound
%i
6
%i
3
Proving So[Bound Correct
1. Define So[Bound(f,σ) = (f
s
,σ
s
)
– TransformaGon pass implemented in Coq.
2. Define predicate: MemoryViolaGon(f,σ)
3. Construct a non-standard operaGonal semanGcs:
– Builds in safety invariants “by construcGon”
4. Show that the instrumented code simulates the “correct”
code:
SB
f ⊢ σ ⟼ σ’
SB
f ⊢ σ ⟼* σ’ ⇒ ¬MemoryViolaGon(f,σ’)
So[Bound(f,σ) = (f
s
,σ
s
) ⇒ [f ⊢ σ ⟼* σ’] ≿ [f
s
⊢ σ
s
⟼* σ’
s
]
SB
Lessons About So[Bound
• Found several bugs in our C++ implementaGon
– InteracGon of undef, ‘null’, and metadata
iniGalizaGon.
• SimulaGon proofs suggested a redesign of
So[Bound’s handling of stack pointers.
– Use a “shadow stack”
– Simplify the design/implementaGon
– Significantly more robust (e.g. varargs)
0%
50%
100%
150%
200%
250%
Run5me overhead
Extracted
Competitive Runtime Overhead
The performance of extracted SoftBound is competitive
with the non-verified original
Plan
• Tour of the LLVM IR
• Vellvm infrastructure
– OperaGonal SemanGcs
– SSA Metatheory + Proof Techniques
• Case studies:
– So[Bound memory safety
– mem2reg
• Conclusion
mem2reg in LLVM
Front-ends
w/o SSA
construcGon
The LLVM IR
w/o φ-nodes
mem2reg
• Promote stack allocas to temporaries
• Insert minimal φ-nodes
• imperaGve variables <z stack allocas
• no φ-nodes
• trivially in SSA form
Backends
SSA-based
opGmizaGons
The LLVM IR in the
minimal SSA form
mem2reg Example
int x = 0;
if (y > 0)
x = 1;
return x;
l
1
: %p = alloca i32
store 0, %p
%b = %y > 0
br %b, %l
2
, %l
3
l
2
:
store 1, %p
br %l
3
l
3
:
%x = load %p
ret %x
The LLVM IR in the trivial SSA form
mem2reg Example
int x = 0;
if (y > 0)
x = 1;
return x;
l
1
: %p = alloca i32
store 0, %p
%b = %y > 0
br %b, %l
2
, %l
3
l
2
:
store 1, %p
br %l
3
l
3
:
%x = load %p
ret %x
The LLVM IR in the trivial SSA form
l
1
:
%b = %y > 0
br %b, %l
2
, %l
3
l
2
:
br %l
3
l
3
:
%x = φ[ 1,%l
2
] [ 0,%l
1
]
ret %x
Minimal SSA a[er mem2reg
mem2reg
mem2reg Algorithm
• Main operaGons
– Phi placement (Lengauer-Tarjan algorithm)
– Renaming of the variables
– Removing loads/stores
• Intermediate stage breaks SSA invariant
– Defining semanGcs & well formedness non-trivial
How to Establish Correctness?
max φs
LAS/
LAA
DSE
DAE
elim φ
Find
alloca
1. Simple aliasing properGes
(e.g. to determine promotability)
2. InstanGate proof technique for
– SubsGtuGon
– Dead InstrucGon EliminaGon
P
DIE
= …
IniGalize(P
DIE
)
PreservaGon(P
DIE
)
Progress(P
DIE
)
4. Put it all together to prove
composiGon of “pipeline”
correct.
Aliasing
ProperGes
subst
DIE
vmem2reg is Correct
Theorem: The vmem2reg algorithm
preserves the semanGcs of the source
program.
Proof:
ComposiGon of simulaGon relaGons from the “mini”
transformaGons, each built using instances of the sdom
proof technique.
(See Coq Vellvm development.) □
RunGme overhead of verified mem2reg
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
200%
sjeng
go
compress
ijpeg gzip vpr
mesa
art
ammp equake
libquantum
lbm milc
bzip2 parser twolf
mcf h264
Geo.mean
Speedup Over LLVM-O0
LLVM's mem2reg Extracted mem2reg
Vmem2reg: 77% LLVM’s mem2reg: 81%
(LLVM’s mem2reg promotes allocas used by intrinsics)
Plan
• Tour of the LLVM IR
• Vellvm infrastructure
– OperaGonal SemanGcs
– SSA Metatheory + Proof Techniques
• Case studies:
– So[Bound memory safety
– mem2reg
• Conclusion
Ongoing Work
• Modular SemanGcs
– Factor out memory model [CAV 15]
– Linking/separate compilaGon
• For:
– more extensibility/robustness to changes
– verifying more analyses and opGmizaGons / program transformaGons
– support for (relaxed) concurrency
– beLer support for casts [PLDI 15]
LLVM SSA
core IR
Memory
Model / IO
Concurrency
• Deep SpecificaGons
– rich, formal, 2-sided, live
• Layers of abstracGon
– Layer Calculus in CerGKOS [Shao et al.]
– Good for proofs!
– Bad for performance?
– ImplicaGons for theory / proof engineering?
• ComposiGonal specificaGon
– ComposiGonal CompCert [Stewart, et al. PLDI 15]
• So[ware Engineering ⇒ Proof Engineering
– Coq development methodology [CPDT: Chlipala]
What engineering principles enable
large-scale deep specifications?
Conclusions
• Proof techniques for verifying LLVM transformaGons
• Verified:
– So[bound & vmem2reg
– Similar performance to naGve implementaGons
• Future:
– IntegraGon with other DeepSpec projects
[hLp://www.cis.upenn.edu/~stevez/vellvm/]