Predicting Test Results without Execution (FSE 2024)
andrehoraa
114 views
26 slides
Jul 17, 2024
Slide 1 of 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
About This Presentation
As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear...
As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear whether these models understand code execution. Particularly, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests. To shed some light on this problem, in this paper, we explore the capability of LLMs to predict test results without execution. We evaluate the performance of the state-of-the-art GPT-4 in predicting the execution of 200 test cases of the Python Standard Library. Among these 200 test cases, 100 are passing and 100 are failing ones. Overall, we find that GPT-4 has a precision of 88.8%, recall of 71%, and accuracy of 81% in the test result prediction. However, the results vary depending on the test complexity: GPT-4 presented better precision and recall when predicting simpler tests (93.2% and 82%) than complex ones (83.3% and 60%). We also find differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%. Our findings suggest that GPT-4 still needs significant progress in predicting test results.
Size: 1.28 MB
Language: en
Added: Jul 17, 2024
Slides: 26 pages
Slide Content
Predicting Test Results
without Execution
Andre Hora
DCC/UFMG [email protected]
1
FSE 2024
Ideas, Visions and Reflections
Motivation & Problem
Software testing is a key practice in modern software development
Developers rely on tests for multiple reasons: avoid regressions, provide fast
feedback, ensure sustainable software evolution, etc.
2
Motivation & Problem
Software testing is a key practice in modern software development
Developers rely on tests for multiple reasons: avoid regressions, provide fast
feedback, ensure sustainable software evolution, etc.
3
Over time, as software systems grow, test suites may
become complex, making it challenging to run the
tests frequently (and locally)
4
CPython testing documentation
“There could be platform-specific code that simply
will not execute for you, errors in the output, etc”
Ray testing documentation
“The full suite of tests is too large to run on a
single machine”
pip
pip testing documentation
“Running pip’s entire test suite requires supported
version control tools to be installed”
It would be important to have the possibility to predict test
results without actually executing test suites,
bypassing any challenge that may exist during test run
5
Large Language Models for Software Engineering
Large Language Models (LLMs) have been adopted in multiple software
engineering tasks [4, 6, 11, 13, 16], mainly related to code generation
However, it is not clear whether LLMs understand code execution
●A recent study performed by Microsoft evaluated the capability of LLMs in
understanding code execution by exploring code coverage prediction tasks
●GPT-4 achieved the highest performance: ~24% in the best-tested scenario
6
7
So far, it is unclear whether LLMs can be used to
predict test results, and, potentially, overcome the issues
of running real-world tests
8
Proposed Work
To shed some light on this problem, we explore the capability of LLMs to predict
test results without execution
We evaluate the performance of GPT-4 in predicting the execution of 200 test
cases of the Python Standard Library
9
Study Design
10
Study Design
1.Selecting Test Cases
2.Creating Prompts and Assessing Answers
3.Evaluation: Precision, Recall, and Accuracy
4.Research Questions:
a.RQ1: All test cases
b.RQ2: Test case complexity
c.RQ3: Test suite
11
Study Design: Selecting Test Cases
1.Five Python Standard Library (ast, calendar, csv, gzip, and string)
2.Two tests per library (total of 10 unique tests)
3.For each test, 20 manually modified tests (10 passing and 10 failing tests)
4.Total of 200 tests (100 passing tests + 100 failing tests)
12
13
Original test
Passing test version
Failing test version
14
Original test
Passing test version
Failing test version
valid input "foo{0}{0}-{1}" and
valid output "foobarbar-6"
15
Original test
Passing test version
Failing test version
invalid output " foo ", i.e.,
with extra blank spaces
valid input "foo{0}{0}-{1}" and
valid output "foobarbar-6"
Study Design: Creating Prompts and Assessing Answers
1.GPT-4: model with the best results in code coverage prediction [16]
2.Create a prompt for each test case and submit them to GPT-4
3.Read the prompt answers to assess the test result prediction
16
Study Design: Evaluation: Precision, Recall, and Accuracy
We evaluate the performance of test result prediction tasks by computing
precision, recall, and accuracy
●True Positive (TP): correctly predicted failing test case
●False Positive (FP): incorrectly predicted failing test case (wrong alert)
●True Negative (TN): correctly predicted passing test case
●False Negative (FN): incorrectly predicted passing test case (missing alert)
17
Study Design: Research Questions
●What is the performance of GPT-4 to predict test results?
●RQ1: All test cases (200 tests)
●RQ2: Test case complexity (100 simple vs. 100 complex)
●RQ3: Test suite (ast, calendar, csv, gzip, and string)
18
Results
19
What is the performance of GPT-4 to predict test results?
20
What is the performance of GPT-4 to predict test results?
21
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
What is the performance of GPT-4 to predict test results?
22
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
RQ2: GPT-4 presented better precision and recall
when predicting simpler tests than complex ones.
Results are still far from 100% even for simple tests.
What is the performance of GPT-4 to predict test results?
23
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
RQ2: GPT-4 presented better precision and recall
when predicting simpler tests than complex ones.
Results are still far from 100% even for simple tests.
RQ3: GPT-4 presented differences among the
analyzed test suites, with the precision ranging from
77.8% to 94.7% and recall between 60% and 90%.
Discussion and Observations
●Correct analysis but incorrect conclusions
○Correct explanation for a passing or failing test, however, the final verdict was incorrect
●Reliance on comments rather than on test code
○Comments may be wrong or outdated
●Explanations based “general knowledge” (rather than on code)
○In some cases, GPT-4 provided explanations based on “general knowledge” to complement
the rationales instead of solely based on source code
24
Summary
We evaluate the performance of GPT-4 in predicting the execution of 200 test
cases of the Python Standard Library
RQ1: GPT-4 presented a low precision and recall in the test result prediction. FN
(missing alert) are more problematic than FP (wrong alerts)
RQ2: GPT-4 presented better precision and recall when predicting simpler tests
than complex ones. However, results are still far from 100%, even for simple tests
RQ3: GPT-4 presented large differences of precision and recall among the
analyzed test suites
25
Predicting Test Results
without Execution
Andre Hora
DCC/UFMG [email protected]
26
FSE 2024
Ideas, Visions and Reflections