Future of Software Development: Coding for Everyone! 5
Background | Problem | Challenge | Solution | Future How to Improve Software Development? 6 Improve developers’ productivity Improve well being Improve overall software development experience Reliability, Usability Security, Privacy Performance, Efficiency, etc. etc.
AIWare Code: Brief Timeline 7 2012-2014 Observed the repetitiveness and statistical predictability of the software. [Hindle et al., 2012; Allamanis et al., 2013; Barr et al., 2014] 2014-2016 Applied basic statistical models (e.g., N-gram) to predict code properties. [Allamanis and Sutton, 2014; Ray et al., 2016] 2017-2020 Advanced statistical models (e.g., RNN) are adapted to code modeling. [Yin and Neubig, 2017; Zhou et al., 2019; Hellendoorn et al., 2020] 2020-2022 Transformer-based language models are introduced to learn general code representations via large-scale pre-training [Feng et al., 2020, Guo et al., 2021, Wang et al., 2021] 2022-Now Large Language Models (LLM) w/ xB params are pre-trained w/ nT tokens. Products Deployed. 2024 AI Engineers: Many AI Agents for solving a complex task SWEAgent/Davin/AutoDev
8 2022-2024 Large Language Models (LLM) w/ xB params are pre-trained w/ nT tokens. Products Deployed [ CodeLLama, GPTs, Starcoder ]. 2024 AI Engineers: Many AI Agents for solving a complex task SWEAgent/Davin/AutoDev 1. Smarter models 2. Prompting with LLM 3. AI agents AIWare Code: Brief Timeline —> Current Trend
Benchmarking & Evaluation 9 Mostly Benchmark Driven Development Efficiency How well a model/agent is performing in a particular task Benchmark saturates soon Dire need to create new challenging benchmarks across different SE Tasks
Evaluation Strategies 10 Efficiency How well a model/agent is performing in a particular tasks Cost Costs of calling LLM APIs Latency How fast a job needs to be resolved Inline editing vs test-driven bug fixing Chat Setting Human in the loop
AI-powered Software Development 11 Amazon Q Tabnine 3% Code Written by ML ML-Powered Program Fixing, Repair, Refactoring, etc. Huge Academic Contributions: 500+ Papers https://ml4code.github.io/
Today’s Talk: Code Generation for Inline Setting 12 2012-2014 Observed the repetitiveness and statistical predictability of the software. [Hindle et al., 2012; Allamanis et al., 2013; Barr et al., 2014] 2014-2016 Applied basic statistical models (e.g., N-gram) to predict code properties. [Allamanis and Sutton, 2014; Ray et al., 2016] 2017-2020 Advanced statistical models (e.g., RNN) are adapted to code modeling. [Yin and Neubig, 2017; Zhou et al., 2019; Hellendoorn et al., 2020] 2020-2022 Transformer-based language models are introduced to learn general code representations via large-scale pre-training [Feng et al., 2020, Guo et al., 2021, Wang et al., 2021] 2022-Now Large Language Models (LLM) w/ xB params are pre-trained w/ nT tokens. Products Deployed. 2024 AI Engineers: Many AI Agents for solving a complex task SWEAgent/Davin/AutoDev
How community evaluate code generation tasks? 13 It is becoming saturated & probably suffers from data leakage issues
14 Do they really understand code?
Duality of Consciousness 15
Duality of Consciousness 16 Natural Language: A sorting program Programming Language: def bubble_sort() { ….. }
Lacks Understanding of Program Semantics (GPT 3.5) 17
Lack of Self Consistency Min et al. “Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain”. ICLR’24 LLMs severely lack consistency----they do not even understand the code and summary written by themselves properly.
static void filter16_roberts( uint8_t *dstp, int width, float scale, float delta, int peak, ...){ uint16_t *dst = (uint16_t *)dstp; int x; for (x = 0; x < width; x++) { int suma = AV_RN16A ( ... ); int sumb = AV_RN16A ( ... ); dst[x] = av_clip(sqrtf(suma*suma + sumb*sumb) * scale + delta, 0,peak); } } static void filter16_roberts( uint8_t *dstp, int width, float scale, float delta, int peak, ...){ uint16_t *dst = (uint16_t *)dstp; int x; for (x = 0; x < width; x++) { float suma = AV_RN16A ( ... ); float sumb = AV_RN16A ( ... ); dst[x] = av_clip(sqrtf(suma*suma + sumb*sumb) * scale + delta, 0,peak); } } Code Text Cannot Distinguish the Inherent Malicious Functionalities (Behaviors) Textually Similar ? Functionally Identical ? Integer Overflow CVE-2021-38094 FFmpeg : libavfilter/vf_convolution.c; Commit: 3650835 FFmpeg : libavfilter/vf_convolution.c; Commit: 99f8d32 19 Identifying (Dis)Similar Program Behaviors is Challenging Ding et al., ”Disco." ACL 2022.
CRUXEval [Gu et al.] CRUX: Code Reasoning, Understanding, and eXecution Evaluation 20 Generate simple short python programs that use some Python Libraries Given Input, predicts output (forward reasoning) Given output, predicts input (backward reasoning)
CRUXEval [Gu et al.] 21 The models though report decent performances on code generation (in known benchmark), they are suffering to show code understanding.
How Neuro-Symbolic Reasoning can help to build “smarter” models?
“smarter” models? That are better in CRUX : Code reasoning, understand & execution
SemCoder A better CRUX aware model for Code reasoning, understand & execution
LLM training steps 25
LLM training steps 26
LLM training steps 27
SemCoder : A Code Execution-Aware Model 28 Filter-out non-executable code Syntax-aware data augmentation Align code execution with source code
SemCoder Data Augmentation & Pre-Processing 29
Data Augmentation by OSS Instruct 30
Data Augmentation by OSS Instruct 31
Data Augmentation by OSS Instruct 32 Out of 43.1k Python solutions from, about 11.6k (26.9%) are inexecutable despite instructions to produce "correct" and "self-contained" code
Data Augmentation by OSS Instruct 33 Random Seeds Syntactically/Semantically incorrect problem
SemCoder Data Augmentation & Preprocessing 34 Augment with good quality code
Quality Data Augmentation Impact 35 Better performance can be reached with good quality but much smaller training dataset
SemCoder : A Code Execution-Aware Model 36 Filter-out non-executable code Syntax-aware data augmentation Align code execution with source code
Data Collection Framework for Collecting Execution Data 37
Data Collection Framework for Collecting Execution Data 38
Data Collection Framework for Collecting Execution Data 39
SemCoder : Execution Learning 40
SemCoder : Execution Learning 41
Forward Monologue 42 Execution Coverage Natural Execution Orders Program State Transition Final Output Given the execution, we ask GPT to annotate with the natural language text explaining the execution
SemCoder : Execution Learning 43
Backward Monologue 44 Abstract Intermediate Constraints ([10.5, 8.2, 10.5, 7.1, 8.2]) previous state as "a disordered list with two 10.5s, two 8.2s, and one 7.1. Concrete Input Given the execution, we ask GPT to annotate with the natural language text explaining the generating monologue.
Monologue Annotation with LLMs Generated by LLM (GPT3.5-turbo) with rejection sampling. Use in-context learning to show the kind of reasoning we expect To verify the monologue: Ask GPT to follow the step-by-step reasoning to generate input (backward reasoning) & output forward reasoning 45
SemCoder : A Code Execution-Aware Model 46 Filter-out non-executable code Syntax-aware data augmentation Align code execution with source code Feedback helps to debug
Train Code LM to Self-Refine 47 Step-1: Collect Model’s Faults Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.
Train Code LM to Self-Refine 48 Step-1: Collect Model’s Faults Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.
Train Code LM to Self-Refine 49 Step-1: Collect Model’s Faults Step-2: Learning to Refine Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.
CYCLE: Train Code LM to Self-Refine 50 Step-1: Collect Model’s Faults Step-2: Learning to Refine Step-3: Iterative Self-Refinement Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.
CYCLE: Still Generates & Better Refine 51 Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24. ??? ??? ??? ??? Past Generation Mask: Mitigate Exact Copy Self-Refine Sample + NL-to-Code Sample Mixture of Data: Learning to Generate and Refine
Overall Results of SemCoder 52 Overall Results of SemCoder
Overall Results of SemCoder 53 Overall Results of SemCoder
54 SemCoder 6.7B Model is beating some other larger open source models by margin
Feedback & Refinement shows great potential (especially for agent centric system) 58 Ding et al ., 2024. CYCLE: Learning to Self-Refine Code Generation. OOPSLA’24.
Neuro-Symbolic reasoning ( better data + execution & abstract reasoning ) shows great potential to build smarter yet smaller models
Neuro-Symbolic reasoning ( better data + execution & abstract reasoning ) shows great potential to build smarter yet smaller models Can be used in more involved SE tasks: Input/Output prediction Program Repair (ii) Vulnerability prediction (iii) Branch Prediction (iv) Semantic Clone Detection
Formal Analysis Probabilistic models Guarantee for the Analysis Noise Intolerant Scalable and Transferrable No theoretical Guarantee Code Generation : Improve Trust Developer Feedback Oriented Automation Explainability Background | Problem | Challenge | Solution | Future 61 Semantic-Aware Code-Model: Future!!
62 Others: Ira Ceka Robin Ding Marcus Min Alex Mathai Vikram Nitin Jinjun Peng Acknowledgements