Predicting Test Results without Execution (FSE 2024)

andrehoraa 114 views 26 slides Jul 17, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear...


Slide Content

Predicting Test Results
without Execution
Andre Hora
DCC/UFMG
[email protected]
1
FSE 2024
Ideas, Visions and Reflections

Motivation & Problem
Software testing is a key practice in modern software development
Developers rely on tests for multiple reasons: avoid regressions, provide fast
feedback, ensure sustainable software evolution, etc.


2

Motivation & Problem
Software testing is a key practice in modern software development
Developers rely on tests for multiple reasons: avoid regressions, provide fast
feedback, ensure sustainable software evolution, etc.


3
Over time, as software systems grow, test suites may
become complex, making it challenging to run the
tests frequently (and locally)

4
CPython testing documentation
“There could be platform-specific code that simply
will not execute for you, errors in the output, etc”
Ray testing documentation
“The full suite of tests is too large to run on a
single machine”
pip
pip testing documentation
“Running pip’s entire test suite requires supported
version control tools to be installed”

It would be important to have the possibility to predict test
results without actually executing test suites,
bypassing any challenge that may exist during test run
5

Large Language Models for Software Engineering
Large Language Models (LLMs) have been adopted in multiple software
engineering tasks [4, 6, 11, 13, 16], mainly related to code generation

However, it is not clear whether LLMs understand code execution
●A recent study performed by Microsoft evaluated the capability of LLMs in
understanding code execution by exploring code coverage prediction tasks
●GPT-4 achieved the highest performance: ~24% in the best-tested scenario
6

7

So far, it is unclear whether LLMs can be used to
predict test results, and, potentially, overcome the issues
of running real-world tests
8

Proposed Work
To shed some light on this problem, we explore the capability of LLMs to predict
test results without execution
We evaluate the performance of GPT-4 in predicting the execution of 200 test
cases of the Python Standard Library
9

Study Design
10

Study Design
1.Selecting Test Cases
2.Creating Prompts and Assessing Answers
3.Evaluation: Precision, Recall, and Accuracy
4.Research Questions:
a.RQ1: All test cases
b.RQ2: Test case complexity
c.RQ3: Test suite
11

Study Design: Selecting Test Cases
1.Five Python Standard Library (ast, calendar, csv, gzip, and string)
2.Two tests per library (total of 10 unique tests)
3.For each test, 20 manually modified tests (10 passing and 10 failing tests)
4.Total of 200 tests (100 passing tests + 100 failing tests)
12

13
Original test
Passing test version
Failing test version

14
Original test
Passing test version
Failing test version
valid input "foo{0}{0}-{1}" and
valid output "foobarbar-6"

15
Original test
Passing test version
Failing test version
invalid output " foo ", i.e.,
with extra blank spaces
valid input "foo{0}{0}-{1}" and
valid output "foobarbar-6"

Study Design: Creating Prompts and Assessing Answers
1.GPT-4: model with the best results in code coverage prediction [16]
2.Create a prompt for each test case and submit them to GPT-4
3.Read the prompt answers to assess the test result prediction
16

Study Design: Evaluation: Precision, Recall, and Accuracy
We evaluate the performance of test result prediction tasks by computing
precision, recall, and accuracy
●True Positive (TP): correctly predicted failing test case
●False Positive (FP): incorrectly predicted failing test case (wrong alert)
●True Negative (TN): correctly predicted passing test case
●False Negative (FN): incorrectly predicted passing test case (missing alert)
17

Study Design: Research Questions
●What is the performance of GPT-4 to predict test results?
●RQ1: All test cases (200 tests)
●RQ2: Test case complexity (100 simple vs. 100 complex)
●RQ3: Test suite (ast, calendar, csv, gzip, and string)
18

Results
19

What is the performance of GPT-4 to predict test results?
20

What is the performance of GPT-4 to predict test results?
21
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).

What is the performance of GPT-4 to predict test results?
22
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
RQ2: GPT-4 presented better precision and recall
when predicting simpler tests than complex ones.
Results are still far from 100% even for simple tests.

What is the performance of GPT-4 to predict test results?
23
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
RQ2: GPT-4 presented better precision and recall
when predicting simpler tests than complex ones.
Results are still far from 100% even for simple tests.
RQ3: GPT-4 presented differences among the
analyzed test suites, with the precision ranging from
77.8% to 94.7% and recall between 60% and 90%.

Discussion and Observations
●Correct analysis but incorrect conclusions
○Correct explanation for a passing or failing test, however, the final verdict was incorrect
●Reliance on comments rather than on test code
○Comments may be wrong or outdated
●Explanations based “general knowledge” (rather than on code)
○In some cases, GPT-4 provided explanations based on “general knowledge” to complement
the rationales instead of solely based on source code
24

Summary
We evaluate the performance of GPT-4 in predicting the execution of 200 test
cases of the Python Standard Library
RQ1: GPT-4 presented a low precision and recall in the test result prediction. FN
(missing alert) are more problematic than FP (wrong alerts)
RQ2: GPT-4 presented better precision and recall when predicting simpler tests
than complex ones. However, results are still far from 100%, even for simple tests
RQ3: GPT-4 presented large differences of precision and recall among the
analyzed test suites
25

Predicting Test Results
without Execution
Andre Hora
DCC/UFMG
[email protected]
26
FSE 2024
Ideas, Visions and Reflections
Tags