Semantic-Aware Code Model: Elevating the Future of Software Development

BaishakhiRay 338 views 62 slides Jul 16, 2024
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

Keynote: How to build smart models for code generation


Slide Content

Semantic-Aware Code Model: Elevating the Future of Software Development Baishakhi Ray * Columbia University * AWS AI Lab

2

3 Coding Code Review/ Testing Debugging Repairing Documentation

4 Coding Code Review/ Testing Debugging Repairing Documentation AI-Powered Software Engineering!!

Future of Software Development: Coding for Everyone! 5

Background | Problem | Challenge | Solution | Future How to Improve Software Development? 6 Improve developers’ productivity Improve well being Improve overall software development experience Reliability, Usability Security, Privacy Performance, Efficiency, etc. etc.

AIWare Code: Brief Timeline 7 2012-2014 Observed the repetitiveness and statistical predictability of the software. [Hindle et al., 2012; Allamanis et al., 2013; Barr et al., 2014]  2014-2016 Applied basic statistical models (e.g., N-gram) to predict code properties. [Allamanis and Sutton, 2014; Ray et al., 2016] 2017-2020 Advanced statistical models (e.g., RNN) are adapted to code modeling. [Yin and Neubig, 2017; Zhou et al., 2019; Hellendoorn et al., 2020] 2020-2022 Transformer-based language models are introduced to learn general code representations via large-scale pre-training [Feng et al., 2020, Guo et al., 2021, Wang et al., 2021] 2022-Now Large Language Models (LLM) w/ xB params are pre-trained w/ nT tokens. Products Deployed. 2024 AI Engineers: Many AI Agents for solving a complex task SWEAgent/Davin/AutoDev

8 2022-2024 Large Language Models (LLM) w/ xB params are pre-trained w/ nT tokens. Products Deployed [ CodeLLama, GPTs, Starcoder ]. 2024 AI Engineers: Many AI Agents for solving a complex task SWEAgent/Davin/AutoDev 1. Smarter models 2. Prompting with LLM 3. AI agents AIWare Code: Brief Timeline —> Current Trend

Benchmarking & Evaluation 9 Mostly Benchmark Driven Development Efficiency How well a model/agent is performing in a particular task Benchmark saturates soon Dire need to create new challenging benchmarks across different SE Tasks

Evaluation Strategies 10 Efficiency How well a model/agent is performing in a particular tasks Cost Costs of calling LLM APIs Latency How fast a job needs to be resolved Inline editing vs test-driven bug fixing Chat Setting Human in the loop

AI-powered Software Development 11 Amazon Q Tabnine 3% Code Written by ML ML-Powered Program Fixing, Repair, Refactoring, etc. Huge Academic Contributions: 500+ Papers https://ml4code.github.io/

Today’s Talk: Code Generation for Inline Setting 12 2012-2014 Observed the repetitiveness and statistical predictability of the software. [Hindle et al., 2012; Allamanis et al., 2013; Barr et al., 2014]  2014-2016 Applied basic statistical models (e.g., N-gram) to predict code properties. [Allamanis and Sutton, 2014; Ray et al., 2016] 2017-2020 Advanced statistical models (e.g., RNN) are adapted to code modeling. [Yin and Neubig, 2017; Zhou et al., 2019; Hellendoorn et al., 2020] 2020-2022 Transformer-based language models are introduced to learn general code representations via large-scale pre-training [Feng et al., 2020, Guo et al., 2021, Wang et al., 2021] 2022-Now Large Language Models (LLM) w/ xB params are pre-trained w/ nT tokens. Products Deployed. 2024 AI Engineers: Many AI Agents for solving a complex task SWEAgent/Davin/AutoDev

How community evaluate code generation tasks? 13 It is becoming saturated & probably suffers from data leakage issues

14 Do they really understand code?

Duality of Consciousness 15

Duality of Consciousness 16 Natural Language: A sorting program Programming Language: def bubble_sort() { ….. }

Lacks Understanding of Program Semantics (GPT 3.5) 17

Lack of Self Consistency Min et al. “Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain”. ICLR’24 LLMs severely lack consistency----they do not even understand the code and summary written by themselves properly.

static void filter16_roberts( uint8_t *dstp, int width, float scale, float delta, int peak, ...){ uint16_t *dst = (uint16_t *)dstp; int x; for (x = 0; x < width; x++) { int suma = AV_RN16A ( ... ); int sumb = AV_RN16A ( ... ); dst[x] = av_clip(sqrtf(suma*suma + sumb*sumb) * scale + delta, 0,peak); } } static void filter16_roberts( uint8_t *dstp, int width, float scale, float delta, int peak, ...){ uint16_t *dst = (uint16_t *)dstp; int x; for (x = 0; x < width; x++) { float suma = AV_RN16A ( ... ); float sumb = AV_RN16A ( ... ); dst[x] = av_clip(sqrtf(suma*suma + sumb*sumb) * scale + delta, 0,peak); } } Code Text Cannot Distinguish the Inherent Malicious Functionalities (Behaviors) Textually Similar ? Functionally Identical ? Integer Overflow CVE-2021-38094 FFmpeg : libavfilter/vf_convolution.c; Commit: 3650835 FFmpeg : libavfilter/vf_convolution.c; Commit: 99f8d32 19 Identifying (Dis)Similar Program Behaviors is Challenging Ding et al., ”Disco." ACL 2022.

CRUXEval [Gu et al.] CRUX: Code Reasoning, Understanding, and eXecution Evaluation 20 Generate simple short python programs that use some Python Libraries Given Input, predicts output (forward reasoning) Given output, predicts input (backward reasoning)

CRUXEval [Gu et al.] 21 The models though report decent performances on code generation (in known benchmark), they are suffering to show code understanding.

How Neuro-Symbolic Reasoning can help to build “smarter” models?

“smarter” models? That are better in CRUX : Code reasoning, understand & execution

SemCoder A better CRUX aware model for Code reasoning, understand & execution

LLM training steps 25

LLM training steps 26

LLM training steps 27

SemCoder : A Code Execution-Aware Model 28 Filter-out non-executable code Syntax-aware data augmentation Align code execution with source code

SemCoder Data Augmentation & Pre-Processing 29

Data Augmentation by OSS Instruct 30

Data Augmentation by OSS Instruct 31

Data Augmentation by OSS Instruct 32 Out of 43.1k Python solutions from, about 11.6k (26.9%) are inexecutable despite instructions to produce "correct" and "self-contained" code

Data Augmentation by OSS Instruct 33 Random Seeds Syntactically/Semantically incorrect problem

SemCoder Data Augmentation & Preprocessing 34 Augment with good quality code

Quality Data Augmentation Impact 35 Better performance can be reached with good quality but much smaller training dataset

SemCoder : A Code Execution-Aware Model 36 Filter-out non-executable code Syntax-aware data augmentation Align code execution with source code

Data Collection Framework for Collecting Execution Data 37

Data Collection Framework for Collecting Execution Data 38

Data Collection Framework for Collecting Execution Data 39

SemCoder : Execution Learning 40

SemCoder : Execution Learning 41

Forward Monologue 42 Execution Coverage Natural Execution Orders Program State Transition Final Output Given the execution, we ask GPT to annotate with the natural language text explaining the execution

SemCoder : Execution Learning 43

Backward Monologue 44 Abstract Intermediate Constraints ([10.5, 8.2, 10.5, 7.1, 8.2]) previous state as "a disordered list with two 10.5s, two 8.2s, and one 7.1. Concrete Input Given the execution, we ask GPT to annotate with the natural language text explaining the generating monologue.

Monologue Annotation with LLMs Generated by LLM (GPT3.5-turbo) with rejection sampling. Use in-context learning to show the kind of reasoning we expect To verify the monologue: Ask GPT to follow the step-by-step reasoning to generate input (backward reasoning) & output forward reasoning 45

SemCoder : A Code Execution-Aware Model 46 Filter-out non-executable code Syntax-aware data augmentation Align code execution with source code Feedback helps to debug

Train Code LM to Self-Refine 47 Step-1: Collect Model’s Faults Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.

Train Code LM to Self-Refine 48 Step-1: Collect Model’s Faults Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.

Train Code LM to Self-Refine 49 Step-1: Collect Model’s Faults Step-2: Learning to Refine Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.

CYCLE: Train Code LM to Self-Refine 50 Step-1: Collect Model’s Faults Step-2: Learning to Refine Step-3: Iterative Self-Refinement Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.

CYCLE: Still Generates & Better Refine 51 Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24. ??? ??? ??? ??? Past Generation Mask: Mitigate Exact Copy Self-Refine Sample + NL-to-Code Sample Mixture of Data: Learning to Generate and Refine

Overall Results of SemCoder 52 Overall Results of SemCoder

Overall Results of SemCoder 53 Overall Results of SemCoder

54 SemCoder 6.7B Model is beating some other larger open source models by margin

With scaling improvement continues!! 55 Model HumanEval (+) MBPP (+) CRUXEval -I CRUXEval -O SemCoder (based on deepseek-coder-6.7-base) 68.3 (62.2) 79.9 (65.9) 51.2 / 52.6 48.1 / 56.6 DeepSeek-Coder-16B 49.4 (42.7) 78.3 (65.1) 39.4 / 51.4 44.1 / 46.4 SemCoder (based on DeepSeek-Coder-16B) 74.4 (67.1) 81.0 (68.3) 49.4 / 55.2 49.1 / 58.5

Impact of Monologue Design Choice 56

Increasing Accuracy with Refine Steps

Feedback & Refinement shows great potential (especially for agent centric system) 58 Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.

Neuro-Symbolic reasoning ( better data + execution & abstract reasoning ) shows great potential to build smarter yet smaller models

Neuro-Symbolic reasoning ( better data + execution & abstract reasoning ) shows great potential to build smarter yet smaller models Can be used in more involved SE tasks: Input/Output prediction Program Repair (ii) Vulnerability prediction (iii) Branch Prediction (iv) Semantic Clone Detection

Formal Analysis Probabilistic models Guarantee for the Analysis Noise Intolerant Scalable and Transferrable No theoretical Guarantee Code Generation : Improve Trust Developer Feedback Oriented Automation Explainability Background | Problem | Challenge | Solution | Future 61 Semantic-Aware Code-Model: Future!!

62 Others: Ira Ceka Robin Ding Marcus Min Alex Mathai Vikram Nitin Jinjun Peng Acknowledgements