[SEMINAR] Generative AI in Software Engineering: Ongoing research and emerging issues
Speaker: Phuong Nguyen, Associate Professor
When: Wednesday, 19th November, 14:30-15:30
Where: Alan Turing Seminar Room
Abstract: The proliferation of disruptive large language models (LLMs) in recent years has enabled a plethora of applications across several domains. In software engineering (SE), LLMs have shown remarkable capability in understanding and generating software artifacts. In our research group, for instance, we have applied LLMs to multiple SE tasks, such as detecting malicious code and summarizing software documentation. Despite their promising performance, LLMs still face several challenges. Among them, hallucination remains a major concern, i.e., generated outputs may be erroneous or nonsensical, undermining the system’s reliability. Furthermore, as LLMs are increasingly relying on external sources such as retrieval-augmented generation (RAG) knowledge bases, they become more susceptible to adversarial attacks disguised in fine-tuning data.
Trained on vast datasets, including public code repositories and security databases, LLMs can effectively reproduce patterns from their pretraining data. While this exposure enhances model utility, it also introduces a critical risk: data memorization, i.e., the verbatim or near-verbatim reproduction of training examples during inference. Memorization in LLMs is not merely a theoretical issue; it poses concrete security and privacy risks. Therefore, it is crucial to investigate whether LLMs perform well on downstream tasks due to genuine generalization from data, or simply because they have been exposed to identical examples during training.
The agenda of the seminar is as follows: (i) a brief introduction to our current research on the applications of generative AI in software engineering; and (ii) a discussion of ongoing research topics, including our recent work on probing memorization in LLMs
