research-article

Free access

Generative AI for Self-Adaptive Systems: State of the Art and Research Roadmap

Authors:

Zhi Jin,

Kenji TeiAuthors Info & Claims

ACM Transactions on Autonomous and Adaptive Systems, Volume 19, Issue 3

Article No.: 13, Pages 1 - 60

https://doi.org/10.1145/3686803

Published: 30 September 2024 Publication History

PDF eReader

Abstract

Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this article aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI’s within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.^†

1 Introduction

Self-Adaptive Systems (SASs) are designed to manage changes and uncertainties within their environment, themselves, and their goals [Weyns, 2020]. To that end, these systems are equipped with a feedback loop that typically acts without human intervention, yet, if preferred, humans may be involved in certain function(s) of the feedback loop. The concept of self-adaptation relates to various fields, including autonomic computing systems, control systems, context-aware systems, auto-tuning systems, and digital twins, and has been actively applied in the software industry [Weyns et al., 2023]. Effective self-adaptation typically relies on a set of four crucial functions or capabilities [Kephart and Chess, 2003]: (i) to monitor their operational environment and their own state; (ii) to analyze the current situation, determine whether the goals are achieved and if not evaluate the options to adapt the system, (iii) to plan an adaptation of the system for the best adaptation option, and (iv) to execute the plan and adapt the system accordingly. The four basic functions of self-adaptation along with the knowledge they share are often referred to as MAPE-K.

Generative Artificial Intelligence (GenAI) leverages AI to learn patterns and structures from training data and generate new data that exhibits similar characteristics [OpenAI, 2023]. Advances in Transformer technology, a deep learning approach capable of processing long-range data dependencies, have significantly propelled GenAI. As a representative, Large Language Models (LLMs), have ignited widespread interest in various research fields. From the perspectives of Semiotics and Linguistics, language serves not only as a symbolic system for representing and comprehending the real world [Campbell et al., 2019], but also as a framework reflecting the logic of human thought, known as linguistic determinism [Hickmann, 2000]. Trained on extensive corpora of human language data, LLMs “naturally” exhibit remarkable performance in (i) understanding the semantic meaning within textual data and (ii) performing logical reasoning using text.¹

When comparing the core capabilities required to realize self-adaptation and the features offered by GenAI, it is clear that GenAI has the potential to significantly improve and enhance the capabilities of SASs. Some studies have taken initial steps to explore this potential. For instance, Nascimento et al. [2023] investigates the use of LLMs as analyzers and planners for reasoning and generating adaptation plans, and [Sarda, 2023] employs LLMs in automatically adapting configurations and deployments addressing faults in microservice systems. However, we still lack a systematic treatment of the potential of GenAI for SASs, and that is the objective of this article.

Yet, obtaining a comprehensive understanding of the benefits and challenges of employing GenAI in SASs is challenging due to three primary factors. Firstly, while GenAI-related research is abundant in fields such as AI and Software Engineering (SE), it is notably limited in leading conferences and journals on SASs like SEAMS, ACSOS, and TAAS. This necessitates searching and analyzing literature from adjacent disciplines to discern their direct and potential contributions to SASs. Secondly, the methodological diversity within the field of SASs—spanning a broad variety of analysis and planning techniques and algorithms [Weyns, 2020]—and the technological diversity in application domains—from microservices [Nunes et al., 2024] to recent advances in cyber-physical systems [Chu et al., 2024; Sanchez et al., 2024] and digital twins [Kamburjan et al., 2024]—complicates a comprehensive understanding of GenAI’s potential for SASs. Thirdly, the rapidly evolving and expanding field of GenAI, as evidenced by the increasing number of publications across various research fields, illustrated in Figure 1, adds another layer of complexity. Although these challenges reveal the difficulty of recognizing the potential of GenAI for SASs, they also underpin its necessity and urgency.

Fig. 1.

To that end, this article aims to provide researchers and practitioners with a comprehensive snapshot of (i) the potential of applying GenAI to SASs, and (ii) the challenges of employing GenAI in SASs. Regarding the first point, we aim to provide a concise overview that broadly covers GenAI’s potential applications within SASs, as summarized in Table 1. Specifically, we adopt two complementary perspectives to select, filter, and categorize the state-of-the-art literature. The first perspective highlights GenAI’s potential to enhance the functions and autonomy of SASs. It specifically explores how GenAI might improve the modules for the construction of SASs, including aspects of monitoring, analysis, planning, execution, and the knowledge these modules share. The second perspective explores how GenAI could improve interactions between humans and SASs in a Human-on-the-Loop (HOTL) setting. Although SASs were initially designed to operate automatically with minimal human intervention [Kephart and Chess, 2003], integrating humans into the decision-making loop may offer considerable benefits [Cámara et al., 2015] and enhance trustworthiness [Calinescu et al., 2018; Weyns et al., 2023]. Specifically, we discuss three key directions that are fundamental design principles in HOTL setting [Gil et al., 2019]: (i) user preference acquisition to enable user-centric adaptation and increase user satisfaction, (ii) system transparency to enhance explainability and enable effective user supervision, and (iii) human-system collaboration to capitalize on the respective strengths of both humans and systems for greater efficiency. Regarding the second point, we aim to outline the challenges in employing GenAI in SASs. Specifically, we discuss the challenges from two perspectives: the first focuses on interesting opportunities for future research, consolidating the insights obtained from this study into a research roadmap that outlines the deficiencies of current studies and potential future research directions for applying GenAI in SASs. The second perspective focuses on the use of GenAI in practice, discussing the inherent shortcomings of GenAI and potential mitigation strategies.

Table 1.

MAPE-K	Monitor	understand context	LLM: transform unstructured data into a structured format, e.g., log parse, SQL automation
			TF/LLM: detect anomaly by capturing contextual and positional semantics within the text
		predict context	TF/LLM/DM: forecast or impute missing time series data
			TF/LLM: predict event sequence by event relation capturing or semantic reasoning
	Analyzer/planner	enhance existing approaches	LLM: enable reasoning based on natural language (e.g., API document) for architecture-based and requirement-driven adaptation
			LLM/DM: provide prior knowledge (as metaheuristics) or act as data synthesizer for enhancing performance or reducing cost in learning-based and search-based approaches
			LLM: translate models to reduce design cost for formal and control-based approaches
		new planning diagram	LLM: employ multiple LLM agents with different roles to collaborate for comprehensive and multi-perspective decision-making
			LLM: enable self-reflection (e.g., analyzing failed experiences from environment feedback) and self-evolution (e.g., summarizing and reusing successful experiences as skills)
			TF/DM: serve as the policy expression or act as a planner (in model-based reinforcement learning)
	Executor		LLM: enable end-to-end robotic manipulation and navigation via vision-language-action models
	Knowledge		LLM: utilize LLMs inherent knowledge and environment feedback to build or refine SAS’s knowledge in formats of knowledge graphs, system models, or world models
			LLM: translate the given (natural language-based) descriptions into (domain-specific language-based) models
HOTL	Preference acquisition		LLM: reason or infer user preference from user feedback
	Transparency		LLM: explain code, log, or decision-making models
			LLM: generate intuitive visualization and interaction
	Collaboration		LLM: decompose task and allocate to machine or human
			LLM: infer or summarize the user’s intention or action patterns to provide support
			LLM: visualize both intermediate steps and change impacts for user correction

Table 1. A Brief Summary of GenAI’s Potential Applications in SASs

TF represents Transformer, DM represents Diffusion model.

The remainder of this article is structured as follows: Section 2 provides the necessary background and related work. Section 3 introduces our methods for searching and filtering literature. Sections 4 and 5 address the current state-of-the-art GenAI literature for SASs, focusing on the perspectives of MAPE-K and HOTL, respectively. Section 6 outlines and discusses challenges built on the above foundation reified in a roadmap. Section 7 discusses threats to validity of this study, and Section 8 concludes our study.

2 Background and Related Work

This section starts with introducing the foundational MAPE-K reference model of SASs. Next, we provide a brief history of GenAI and introduce several generative models targeted in this study, including Transformer, LLM, and the diffusion model. Additionally, we discuss relevant surveys and reviews pertinent to this article.

2.1 SAS with MAPE-K Feedback Loop

From an external system perspective, self-adaptation equips a software system with the ability to adjust itself to meet user-defined goals in response to uncertainties and changes that are hard or impossible to deal with before runtime [Weyns, 2020]. From an internal perspective, an SAS comprises a dual structure: the managed system, which interacts with the environment to address domain-specific user concerns, and the managing system, which comprises a feedback loop that coordinates with the managed system to address adaptation concerns, i.e., concerns about the domain concerns. On this basis, the managing system generally includes the key elements of Monitor, Analyzer, Planner, Executor, and shared Knowledge, collectively known as the MAPE-K loop [Dobson et al., 2006; Kephart and Chess, 2003; Weyns et al., 2012b], as illustrated in Figure 2. In Section 4, we employ the MAPE-K reference model to categorize and discuss the relevant literature. For a comprehensive introduction to self-adaptation, we refer the reader to [Weyns, 2020].

Fig. 2.

2.2 History and Scope of GenAI

GenAI has a longstanding history, with its earliest developments traceable back to Hidden Markov Models [Knill and Young, 1997] and Gaussian Mixture Models [Reynolds and Rose, 1995], used for generating time-series data. With the emergence of deep learning, each domain has developed its own methods. In the Natural Language Processing (NLP) field, classic works include N-grams [Bengio et al., 2000], Recurrent Neural Networks (RNNs) [Mikolov et al., 2010], and Long Short-Term Memory (LSTM) [Graves, 2012]. In Computer Vision (CV), representative studies involve Generative Adversarial Networks [Goodfellow et al., 2020], Variational Autoencoders [Kingma and Welling, 2022], and diffusion models [Ho et al., 2020]. Subsequently, these two fields have intersected in the Transformer architecture, initially utilized for NLP tasks but later introduced into the CV domain, such as Vision Transformer [Dosovitskiy et al., 2021] and Swin Transformer [Liu et al., 2021]. Additionally, the versatility of Transformers further spurred the development of multi-modal models like Contrastive Language–Image Pre-training [Radford et al., 2021]. Later, with the introduction of Generative Pre-trained Transformer (GPT), and the continuous increase in the number of parameters and training text data, GPT-3 [Brown et al., 2020] demonstrated astonishing generalization capabilities, being referred to as LLMs.

In this article, we primarily focus on recent achievements such as Transformers, LLMs, and diffusion models, driven by the timeliness and relevance of this study. These technologies provide invaluable inspiration for advancing the development of SASs.

2.3 Transformer

The Transformer [Vaswani et al., 2017] is a deep learning architecture optimized for sequence-to-sequence tasks, which forms the basis for several advanced models like Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2019], GPT-3 [Brown et al., 2020], and DALL-E-2 [Ramesh et al., 2022]. Unlike its predecessors (RNNs), the Transformer excels in managing long-range dependencies within texts. This capability stems from its self-attention mechanism, which evaluates the relationship between all pairs in an input sequence. A Transformer model consists of an encoder and a decoder. The encoder transforms the input sequence into a set of vectors representing the text in a high-dimensional space (called hidden representations). The decoder then generates output tokens, using the context from the encoder and the part of the output it has already produced. Notably, the Transformer allows for parallel training of its components, significantly enhancing the efficiency and scalability of model training for large datasets.

BERT, or Bidirectional Encoder Representations from Transformers, builds upon the Transformer’s encoder and integrates a bidirectional self-attention mechanism. This enhancement enables BERT to gain a more comprehensive understanding of the context surrounding each word in the text. It is crucial to clarify that BERT includes only the encoder component of the Transformer, rendering it a “non-generative” model focused on producing sophisticated language representations for downstream tasks like classification.

2.4 LLMs

LLMs refer to Transformer-based models with billions of parameters, pre-trained on vast amounts of text data.² For example, GPT-3 is equipped with 175 billion parameters and utilizes a pre-processed dataset of 570GB [Brown et al., 2020]. The data sources for pre-training typically include web pages, conversational texts, books, multilingual and scientific texts, and program code, all subjected to quality filtering, de-duplication, and privacy data reduction to enhance quality and privacy.

Architecture. The architecture of LLMs primarily falls into one of three possible categories: (i) encoder-only models like BERT [Devlin et al., 2019]; (ii) encoder-decoder models such as T5 (Text-to-Text Transfer Transformer) [Raffel et al., 2023]; and (iii) decoder-only models, exemplified by GPT-3. Another notable architecture is the Mixture of Experts, speculated to be used in GPT-4, which employs multiple specialized sub-models, or “experts,” to improve scalability.

Fine-Tuning. LLMs undergo fine-tuning with domain-specific datasets to enhance performance on particular tasks [Howard and Ruder, 2018]. A notable instance is OpenAI’s Codex [Chen et al., 2021b], based on GPT-3 and fine-tuned for coding tasks. Additionally, LLMs typically require: (i) Instruction tuning: training with instruction-formatted datasets to better follow user-given natural language instructions; (ii) Alignment tuning: employing methods like Reinforcement Learning from Human Feedback (RLHF) [Christiano et al., 2023] to better align the models with human values such as helpfulness, harmlessness, and honesty [Liu et al., 2024e].

Basic Prompting Strategies. LLMs are applied to various tasks. Enhancing their effectiveness, especially for complex tasks, involves developing effective prompting strategies [Sahoo et al., 2024]. Basic strategies include: (i) In-Context Learning (ICL): based on task descriptions, with added examples or demonstrations [Brown et al., 2020]. Techniques like Retrieval-Augmented Generation (RAG) [Lewis et al., 2020] are used to provide appropriate examples; (ii) Chain-of-Thought (CoT): includes zero-shot CoT [Kojima et al., 2022] with prompts like “Let’s think step by step,” and few-shot CoT [Wei et al., 2023a], which integrates intermediate reasoning steps into prompts. More complex strategies, such as Tree-of-Thought (ToT) [Yao et al., 2023a], Graph-of-Thought (GoT) [Besta et al., 2024], and self-consistency [Wang et al., 2023g], are also employed to enhance CoT.

Abilities of LLMs. As summarized in Zhao et al. [2023b], LLMs demonstrate diverse capabilities: (i) Language generation, or conditional text generation, involves creating text that meets specific requirements for tasks like summarization, translation, and question answering, including text in natural languages, mathematical formulas, or program code; (ii) Knowledge utilization refers to the ability of LLMs to accomplish knowledge-intensive tasks (e.g., common sense question answering) based on supporting factual evidence. Specifically, it requires LLMs to properly utilize the rich factual knowledge from the pre-training data or retrieve external data when necessary; (iii) Reasoning refers to the ability to understand and utilize supporting evidence or logic to derive conclusions or make decisions. The main types of reasoning include knowledge reasoning to use logical relations and knowledge to answer the given question, symbolic reasoning to manipulate symbols in a formal rule setting to fulfill some specific goal, and mathematical reasoning to utilize mathematical knowledge and logic for solving mathematical problems or generating proof statements; (iv) Human alignment, ensuring models conform to human values like truthfulness, honesty, and safety; (v) Interaction with the external environment, enabling models to receive feedback and perform actions based on behavioral instructions, for instance, generating detailed action plans in natural language or other formats based on the natural language-based feedback; and (vi) Tool manipulation, using external tools like search engines, calculators, and APIs to enhance task performance. Note that these capabilities, while categorized, often overlap in practical applications, with tasks frequently requiring a combination of different abilities.

Multimodal LLMs (MLLMs). MLLMs extend the capabilities of traditional text-based LLMs to include understanding multiple modalities like text, images, audio, and video. These models enhance context understanding and interaction by integrating data across these modalities. There are two main approaches for handling multimodal information within MLLMs. In early studies, fusion mechanisms were studied to integrate features from various modalities at different stages-early, mid-level, or late fusion for different purposes. Recently, unified modeling [Hu and Singh, 2021] such as OpenAI’s GPT-4o [OpenAI, 2024] and Google’s Project Astra [Deepmind, 2024], which processes various data types through a consistent framework rather than at specific fusion points, is becoming mainstream. For instance, transformers can manage inputs from different modalities by adjusting their input layers, such as using position encoding for text and spatial encoding for images.

2.5 Diffusion Models

Diffusion models [Ho et al., 2020; Sohl-Dickstein et al., 2015] represent a class of generative models that simulate a diffusion process to generate data. It typically involves two phases: a noise-adding phase that gradually introduces noise until the data becomes completely random, and a denoising phase that reconstructs the original data from the noise. The field primarily features three key publications. Denoising Diffusion Probabilistic Models (DDPM) [Ho et al., 2020] are generally considered pioneering in the field, introducing noise through an ordered Markov chain and reversing this process by learning a step-by-step denoising model. Simultaneously, Noise Conditional Score Networks [Song and Ermon, 2019] introduces a score matching-based method that uses conditional score networks to estimate the gradient (score) of data at various noise levels, and this score guides the data from a noisy state back to a clean state. Following this, Score-Stochastic Differential Equations (SDE) [Song et al., 2021] places the score-based diffusion model within the framework of SDE. It directly simulates a continuous-time SDE to generate or denoise data.

Diffusion models facilitate two types of generation: unconstrained generation, which synthesizes samples from pure noise without guiding data, and constrained generation, which utilizes additional information such as class labels, text descriptions, or other images to steer the output toward specific results. Due to the inherent strengths of diffusion models, their applications have broadened well beyond initial CV tasks. These advantages include (1) effective capture of the complexity of high-dimensional data distributions, (2) support for various data types, and (3) a gradual denoising generation process that facilitates the production of high-quality, complex samples. Consequently, diffusion models are now applied to a diverse array of generative tasks. These applications extend to natural language [Zou et al., 2023], tabular data [Kim et al., 2022], 3D models [Lin et al., 2023b], medical design [Xie et al., 2022], and even (imitation-based motion) planning [Janner et al., 2022]. The primary drawback of diffusion models is their computational speed and cost, as these models typically necessitate hundreds of iterations in the denoising process.

2.6 Related Surveys and Reviews

The rapidly expanding research field of GenAI encompasses a variety of surveys and reviews that delineate the state-of-the-art and lines for future research.

Firstly, there is a wealth of literature reviewing LLMs and diffusion models, encompassing general surveys [Yang et al., 2023e] as well as more specialized reviews focusing on particular technical aspects, such as hallucinations [Huang et al., 2023c] and trustworthiness [Sun et al., 2024a] in LLMs. These reviews provide specific technical details and applications of GenAI, enhancing our basic understanding of these technologies. In the field of SE, extensive literature [Fan et al., 2023b; Zhang et al., 2023a, 2024a] details the application of LLMs to enhance processes across the software development lifecycle, including requirements engineering, design, development, quality assurance, and maintenance. These studies illuminate potential advantages for engineering adaptive systems. Within the context of autonomous systems, several works [Guo et al., 2024a; Xi et al., 2023; Wang et al., 2024b] discuss the augmentation of agent components by LLMs, covering profiling (which defines an agent’s role) [Chen et al., 2024f], perception, memory, decision-making, and action modules. These enhancements are particularly advantageous for the analysis and planning stages of adaptive systems. In the sphere of Human–Computer Interaction (HCI), research [Shi et al., 2024a] reviews interactions between humans and GenAI. It explores GenAI’s generative capabilities across various modalities—textual, 2D visual, audio, and 3D graphics—and their applications in fields like writing, programming, and education. These insights are invaluable for integrating HOTL approaches in adaptive systems. Furthermore, LLMs are increasingly prevalent in specialized application areas such as intelligent transportation systems [Yan and Li 2023], autonomous driving [Cui et al., 2023; Yang et al., 2023b], and robotics [Zeng et al., 2023b]. The innovative approaches from these fields may provide transferable insights into general methodologies for adaptive systems. Additionally, improvements in specific technologies like evolutionary computation [Cai et al., 2024b] and Reinforcement Learning (RL) [Pternea et al., 2024; Zhu et al., 2024] offer enhanced planning within SASs. While there will be some overlap with the previously mentioned surveys and reviews in the selection of literature, this article is dedicated to providing a literature overview and discussing future research challenges, with a distinct perspective on SASs.

Additionally, another directly relevant study is [Li et al., 2024c], where we initially explored the potential of LLMs in SASs. This article expands that initial study in the following aspects: Firstly, within the MAPE-K, we introduce three new categories on enhancing planning methods (Section 4.2.6), LLMs as planner (Section 4.2.7), and Diffusion model as planner (Section 4.2.8), which are specifically relevant to aspects of autonomy relevant to SASs. Secondly, we further refine the categories within both MAPE-K and HOTL, detailing the specific contributions of each study referenced. In our initial study, each category typically highlighted only one piece of literature. Finally, this article introduces a new section that outlines potential issues and discusses research challenges for the integration of GenAI into SAS with a roadmap.

3 Literature Search and Selection Methodology

This section outlines our methodology for systematically searching and selecting relevant GenAI literature relevant to SAS, focusing on targeted conferences, specific keywords, and rigorous selection criteria to ensure the inclusion of the most relevant and timely studies.

3.1 Literature Search

Given the rapid expansion of the GenAI research field, comprehensively covering all existing literature is impractical. Therefore, our literature search strategy focuses on sourcing publications related to GenAI relevant to SAS from top conferences across relevant fields and categorizing this literature. We conducted our literature search using the following criteria.

Target Conference. Given the topics of MAPE-K and HOTL, we conducted a literature search targeting leading conferences across various related fields. These included SAS (SEAMS, ACSOS), SE (ICSE, ASE, FSE, RE), AI (AAMAS, ACL, ICLR, IJCAI, NeurIPS, AAAI, ICML, GECCO), HCI (CHI, UIST), and Robotics (CoRL, ICRA). Additionally, workshops and companion proceedings of these conferences were also included.

Keywords. We used keywords of Transformer, BERT, T5, GPT, pre-train, language model, LLM, ChatGPT, generative, and diffusion, as these terms could be directly related to the topic of GenAI discussed in this article.

Publication Year. We collected literature from 2017 until June, 2024, as the Transformer was introduced in 2017 [Vaswani et al., 2017]. For diffusion, we collected literature from 2020 onwards, as the concept of denoising diffusion is generally considered to have been proposed in 2020 [Ho et al., 2020]. Additionally, for the timeliness, our policy was to include up-to-date literature as much as possible, with SEAMS, ICSE, AAMAS, ICLR, AAAI, CHI, and ICRA including literature from 2024. For RE, we only included main conference articles and not workshop articles, as workshop articles were not yet publicly available when searching for literature.

Source. For articles that are published in conference proceedings, we obtain articles from official databases like IEEE Xplore and ACM Library. For articles that have not yet been officially published, we first collect article titles using the official conference program, and then collected the full articles through preprint platforms such as OpenReview or ArXiv. In total 18 articles (5 from RE and 13 from ICRA) were not available on these platforms and were not included in our review.

Search Results. As a result of our literature search, we obtained a total of 5,874 pieces of literature. The breakdown is as follows: 3 from SAS [Li et al., 2024c; Nascimento et al., 2023; Sarda, 2023], 302 from SE, 5061 from AI, 245 from HCI, and 228 from Robotics.

3.2 Literature Selection and Categorization

We screened and categorized the searched literature based on the following steps.

Relevance to GenAI. To confirm the relevance to GenAI, we initially scrutinized the abstracts to ensure that terms like “language model,” “generative,” and “diffusion” used in the titles align with the context of GenAI discussed in this article. In this process, we primarily filtered out non-transformer-based language models, such as LSTM and RNN, and instances where these terms are used with different implications, such as “diffusion” which refers to a physical concept in dynamics rather than a denoising process of data. In this step, we excluded 1,401 pieces of literature, leaving a total of 4,473 articles remaining for the next analysis.

Relevance to SAS. After confirming the relevance of the selected works to GenAI, we further assessed their direct connection to SASs. Our primary focus here is to determine if these studies were relevant to the topics of MAKE-K or HOTL.

To determine the relevance of articles, we started with excluding studies that are focusing on specifics of GenAI only (e.g., improvements in Transformer, application of LLMs in text generation, collaboration in writing). We then applied the following set of selection rules. Firstly, we omitted dataset-related literature that, while potentially enhancing the evaluation of SASs, does not directly improve their functionality. Secondly, for the topic of “monitor,” we excluded literature focused on vision-based scene perception (such as object recognition, scene segmentation, and entity relation extraction), and predictions in various applications, including action prediction, posture prediction, and biochemical predictions like proteins and weather forecasting. However, we retained more general detection and prediction methods and literature that is more relevant to SAS domains, such as traffic flow prediction and log detection. Thirdly, for “knowledge,” we did not consider natural language-based knowledge. For instance, Hou et al. [2024] encapsulates factors like context and occurrence time into one “memory” to simulate human recall of past experiences. Fourthly, in relation to “analysis & planning” (i.e., decision-making), we have primarily focused on two types of studies. The first type pertains to leveraging GenAI to strengthen the “seven waves” [Weyns, 2020], established approaches for engineering SASs. The second type of study mainly involves utilizing GenAI to realize or enhance decision-making. We exclude studies on the application of Transformers in communication, as they are often very technical in nature although they may have some potential applications in distributed planning settings. For example, literature such as [Inala et al., 2020] explored the use of Transformers to generate communication graphs, which aim to minimize communication in multi-agent planning scenarios. In relation to HOTL and in particular “preference acquisition,” we excluded the literature related to emotion detection, such as text-based empathy detection and depression detection, as well as literature focusing on Transformer and LLM’s human alignment. Fifth and Lastly, we did not include research on code generation, a prominent topic in SE. While we acknowledge that automatic code generation may assist in the development and evolution of adaptive systems, its contributions are deemed indirect.

To confirm the relevance, two authors independently reviewed each article, and a third author was involved to discuss and resolve conflicts. As a result, we finally obtained 219 pieces of literature.

First and Section-Level Categorization. We began by establishing a preliminary categorization, using MAPE-K and HOTL as two fundamental first-level categories. For further refinement, we categorized MAPE-K into monitor and analyzer and planner, executor, and knowledge. Here, due to the frequent difficulty in distinguishing between analyzer and planner in various studies, we combine them into one category. For HOTL, we devised a secondary categorization that includes preference acquisition, transparency, and collaboration. These categories directly correspond to the three main purposes involved in integrating HOTL within SAS. Based on these primary and secondary categories, we proceed with a preliminary classification of the literature.

Further Categorization Refinement. Subsequently, based on the primary and secondary categorizations mentioned above, we discussed further subdivisions of the literature. Since the criteria for these subdivisions vary for each secondary category, we refrain from detailing the methods for further subdivision in this section. Instead, the criterion will be introduced separately in the subsequent sections, tailored to the specific nuances of each category.

Results of Literature Selection and Categorization. Figure 3 summarizes the results of our literature selection and classification, where the numbers following each category represent the number of articles within that category. We have made the specific classification and the complete list of literature publicly available, which can be accessed at https://github.com/545659928/GenAI4SAS.

Fig. 3.

4 Enhancing the Modules in MAPE-K Feedback Loops

This section discusses GenAI’s potential enhancements to MAPE-K modules derived from the review of the literature. Figure 4 summarises the results. Note that we deal with the analyzer and planner together, as the distinction between these roles is often mixed in studies.

Fig. 4.

4.1 Monitor

In SASs, the primary tasks of the monitor are: (i) detecting changes within the managed system and its context, i.e., its operational environment, and reflecting these changes in the runtime models; and (ii) determining whether to trigger the analyzer [Weyns, 2020]. A well-known work that puts the emphasis on the monitoring function (and context relevance in particular) is DYNAMICO [Villegas et al., 2013]. Traditionally, both monitoring tasks are realized by manually defined mechanisms. Leveraging the Transformer’s ability to handle long-range dependencies in data, as well as the reasoning capabilities of LLMs, has the potential to enhance self- and context-awareness of an SAS significantly.

4.1.1 Context Understanding.

Context understanding involves two key areas: data structuring and anomaly detection, with the former often translating machine-unreadable data into machine-readable, and the latter often serving as a trigger for adaptation.

Data Structuring. Data structuring transforms unstructured data into a structured format to aid in storing observed data as knowledge and support decision-making. In the realm of SE, there has been notable progress using LLMs for log parsing. For instance, LLMParser [Ma et al., 2024a] enhances parsing accuracy through fine-tuning different open-source LLMs, and [Le and Zhang 2023; Wang et al., 2023d] explores the impact of various prompt strategies. Chen et al. [2024c] utilizes a prefix-tree to enhance LLMs matching the most suitable log output format, thereby enhancing parsing efficiency and accuracy. The results in the above studies show a notable 95% average accuracy, significantly higher than state-of-the-art parsers. However, research by Astekin et al. [2024] highlights the lack of determinism in LLM outputs for log parsing, showing that the results across six LLMs and 16 system logs are unstable even at a temperature setting of zero. Additionally, some studies have explored the use of LLMs as interfaces for data repositories to automate data operations. For example, Li et al. [2023b] has investigated the potential of translating text to SQL data in real-world “dirty” datasets. The study involved 95 datasets (33.4 GB) across 37 professional fields, revealing that even with GPT-4, the execution accuracy is only 54.89%.

Anomaly Detection. Many studies have utilized Transformers for unsupervised anomaly detection by capturing contextual and positional semantics within text [Le and Zhang, 2021; Xu et al., 2022b]. For instance, Xu et al. [2022b] enhances the anomaly detection capabilities of Transformers by implementing an Anomaly-Attention mechanism that amplifies the distinguishability between normal and abnormal patterns. Specifically for log-based anomaly detection [Ma et al., 2024c], fine-tunes BERT to understand universal log representations through three fine-tuning tasks: (1) leveraging abbreviations to enhance the understanding of abbreviations, (2) leveraging natural language descriptions of logs to enhance the understanding of domain-specific terminology, and (3) utilizing log templates conveying the same semantics across different vendors (e.g., WIFI router logs from Cisco and Huawei). The results indicate that the above method achieves an average F1 score of 0.8 in the task of risk log identification. Regarding LLM-based log anomaly detection, Liu et al. [2024d] addresses the online scenario where logs originate from diverse application environments. These logs often change in format and content due to regular software updates. The study introduces a set of prompt strategies tailored for log analysis tasks. The effectiveness of these strategies was evaluated through their performance in log parsing, achieving an average F1 score of 0.797, and in anomaly detection, with an average F1 score of 0.412.

Others. Additionally, Transformers are also used in sensor information fusion [Shao et al., 2022], and (unobservable) state estimation [Yoneda et al., 2024]. For instance, Yoneda et al. [2024] employs LLMs to maintain an estimate of the world state, which is usually unobservable, through reasoning. For example, after the action of moving a cup, the changed position of the cup is inferred.

4.1.2 Context Prediction.

Context Prediction is important because it can identify potential future target violations, thereby proactively triggering adaptation. Here we discuss two types: time series data, which is usually quantitative and often measured at regular intervals, and event sequences that emphasize the order and timing of events without necessarily adhering to a uniform time scale.

LLM-Based Time Series Forecasting. In earlier studies, many improvements to Transformer-based forecasting have been proposed [Cao et al., 2023; Chen et al., 2024g; Huang et al., 2023a; Jiang et al., 2023a; Liu et al., 2022; López-Ruiz et al., 2022; McDermott et al., 2023; Tang and Matteson 2021; Tang and Zhang 2023; Wen et al., 2023b; Wu et al., 2020; Zhang and Yan 2023; Zeng et al., 2023a; Zhou et al., 2022]. As state-of-the-art and the generalized Transformer-based model in particular, researchers from Google have introduced the TimesFM (Time-series Foundation Model) [Das et al., 2024b]. This model showcases competitive zero-shot performance across various public datasets, highlighting its robustness across different forecasting history lengths, prediction lengths, and temporal granularities.

Regarding LLM-based methods, LLMTIME [Gruver et al., 2023] proposed a zero-shot time series forecaster that encodes numbers as text and samples possible extrapolations as text completions. Additionally, the article presents two interesting findings: (i) LLMs can naturally accommodate missing data; and (ii) the Uncertainty Calibration—how well a model’s predicted probabilities reflect the actual likelihood of outcomes—of GPT-4 is less reliable than GPT-3, likely due to interventions such as RLHF. Zhou et al. [2023b] trains a language model to achieve state-of-the-art or comparable performance in all major types of time series analysis tasks, including short/long-term forecasting, imputation, few-shot and zero-shot forecasting. Furthermore, both theoretically and empirically, they found that the self-attention mechanism performs a function similar to PCA (Principal Component Analysis), which helps explain the universality of transformer models in handling various data analysis tasks. Jin et al. [2024] introduces Time-LLM, which translates time series into text prototype representations that are more naturally processed by LLMs. This approach augments the input context with declarative prompts, such as domain expert knowledge, to guide LLM’s forecasting capabilities. Cao et al. [2024] proposes a more complex processing pipeline for time series analysis that includes (a) decomposing the time series input into trend, seasonality, and residual information, (b) embedding and inputting each of these components into a pre-trained GPT model separately, and (c) recombining the outputs to form the final prediction. The majority of the above studies demonstrate the performance of LLM-based forecasting surpassing that of mainstream specialized models.

Diffusion Model-Based Time Series Forecasting. The concept of “diffusion” has effectively extended to time series analysis, demonstrating significant performance in time series forecasting and imputation. TimeGrad [Rasul et al., 2021], the pioneering DDPM-based work in this area, injects noise into data at each predictive time point, followed by a gradual denoising process using a backward transition kernel conditioned on historical time series data. Subsequent studies focus on improving the performance and reducing the training costs of the above method [Fan et al., 2024b, Kollovieh et al., 2023; Shen et al., 2024]. In addition to the above methods for multivariate time series, diffusion models have also been adapted for Spatio-temporal Graphs, which incorporate time and spatial relationships between different entities, such as in traffic prediction. Notable works include DiffSTG [Wen et al., 2023a] and Graph Convolution Recurrent Denoising Diffusion [Li et al., 2023c]. These models have shown their effectiveness across thousands of dimensions in real datasets and have achieved state-of-the-art performance on multiple real-world datasets.

Diffusion model-based prediction and generation have also been specifically applied across various application domains. In SE, Maat [Lee et al., 2023] uses diffusion models to forecast future performance metrics in cloud services and employs an additional detector to identify impending anomalies. In autonomous driving, Generative AI for Autonomy (GAIA-1) [Hu et al., 2023b] explores leveraging video, text, and action inputs to generate realistic driving scenarios in the manner of video. Specifically, GAIA-1 demonstrates its ability to understand and finely control static and dynamic concepts such as the distribution of buildings and traffic lights, comprehend 3D assemblies like pitch and roll induced by road irregularities, and grasp decision causality, such as the reactions of road users. In traffic scenarios, diffusion models have been applied to predict and generate information including the distribution of vehicle poses, orientations, and trajectories across different geographical regions [Lu et al., 2024; Pronovost et al., 2023; Zhong et al., 2023a, 2023b].

Furthermore, the imputation of time series data has also been extensively explored with methods such as Conditional Score-based Diffusion for Imputation [Tashiro et al., 2021]. Yang et al. [2023d] utilizes Diffusion+, a sample-efficient diffusion model, to impute data that trains another (non-diffusion) prediction model, thereby enhancing cloud failure prediction at Microsoft 365.

LLM-Based Event Sequence Prediction. In Transformer-based prediction, Zhu et al. [2023b] focuses on script event prediction by incorporating event-level knowledge into the fine-tuning of Transformers, thus capturing inter-event relationships more effectively. GraphBERT [Du et al., 2022] automatically constructs event graphs, which is similar to state machines, from natural language descriptions. Shou et al. [2023] incorporates causal reasoning for time-event sequences into Transformers to enhance predictive accuracy. Regarding LLM-based methods, Shi et al. [2023] introduces the Language Model in Event Prediction framework. This framework employs an event sequence model to generate multiple prediction candidates, which are then evaluated through abductive reasoning by LLMs. The LLMs match patterns against actual previous events and retrieve the most pertinent sequences. A ranking model selects then the predictions with the strongest support from the retrieved evidence. Additionally, Geometrically Grounding LLM (GG-LLM) [Graule and Isler, 2024] for Human Activity Forecasting, aids in human-aware task planning. For instance, if a human is observed holding a laundry basket, GG-LLM would advise a cleaning robot against cleaning the laundry room at that time. GG-LLM incorporates a semantic map detailing room locations and item placements and is fine-tuned using extensive text corpora that describe typical human behaviors, enabling it to learn likely sequences of human actions and activities.

Summary—Monitor. GenAI and LLM in particular offer a huge potential to support the monitor function of SASs in two particular directions: context understanding and context prediction. Regarding context understanding, LLMs have the potential to enhance the structuring of unstructured data collected by the monitor and facilitate anomaly detection, which are crucial features to deal with understanding the growing amounts of data systems face. Regarding context prediction, LLM-based and diffusion-based methods offer the potential to enhance the monitor function with time series forecasting and event sequence prediction, which are key to identifying potential future target violations.

4.2 Analyzer and Planner

The analyzer and planner play pivotal roles in self-adaptation. The main tasks of the analyzer are exploring possible configurations for adaptation (i.e., adaptation options) and evaluating them, while the main tasks of the planner are selecting the best adaptation option based on the adaptation goals and generating a plan to adapt the managed system for this new configuration [Weyns, 2020]. However, it is often not easy to distinguish between these roles as the functions of the analyzer and the planner may be integrated (often referred to as decision-making). Hence, we deal with them together. We start with describing how LLMs have the potential to enhance specific aspects of different aspects of engineering SASs based on the “seven waves” of research interests within the research community [Weyns, 2020]. ³ This discussion extends from Section 4.2.1 through Section 4.2.5. Next, we examine how LLMs have the potential to augment current planning that is generally used in SASs in Section 4.2.6. Finally, we introduce two new planning paradigms for SASs that leverage the direct use of LLMs and diffusion models as planners respectively. These paradigms are described in Sections 4.2.7 and 4.2.8.

4.2.1 Architecture-Based Adaptation.

Architecture-based adaptation centers on leveraging software architecture to realize self-adaptation, reflected in two complementary functions. First architecture allows abstracting the design of SASs through layers and system components. A seminal model in this approach is the three-layer reference model of Kramer and Magee [Sykes et al., 2008], which delineates the system’s operations across three layers: (a) the goal management layer, responsible for generating action plans; (b) the change management layer, tasked with configuring components per these plans; and (c) the component layer, handling the operations of these components. Formal Model for Self-Adaptation (FORMS) formalizes this structure [Weyns et al., 2012b]. Second, architecture enables the system to exploit high-level models to reason about the adaptation options, potentially system wide. Characteristic work in this area over time are Rainbow [Garlan et al., 2004], Models at Runtime [Blair et al., 2009], QoSMOS [Calinescu et al., 2011], proactive adaptation [Moreno et al., 2015], and ActivFORMS [Iftikhar and Weyns, 2014].

Recent developments in LLMs reflect similar principles, which can be considered as CoTs with external tool calls [Inaba et al., 2023]. For specific problems or goals, an LLM initially segments the problem into sub-problems either sequentially or hierarchically, selects appropriate components, often APIs, for addressing each sub-problem, and accurately deploys and calls these components. For instance, HuggingGPT [Shen et al., 2023] uses LLMs as controllers to orchestrate existing AI models with language interfaces. HuggingGPT selects AI models based on their functional descriptions from Hugging Face and employs them for executing complex tasks across language, vision, and speech domains, demonstrating robust performance. Another example, ToolLLM [Qin et al., 2024] tackles problem-solving by generating sequences of API calls from a pool of 16,464 real-world APIs.

Furthermore, some studies focus specifically on the aspect of component selection, akin to the change management layer. Schick et al. [2023] introduces Toolformer, a model trained explicitly to determine which APIs to use, the timing of their invocation, and the parameters to be passed. Zohar et al. [2023] introduces Language-Only Vision Model Selection, which facilitates model selection and performance prediction based solely on textual descriptions of the application. Lastly, Alsayed et al. [2024] proposes MicroRec, a framework designed to recommend or select microservices using information from README files and Dockerfiles. Zhuang et al. [2024] considers the API call space as a decision tree, where nodes represent API function calls and their cost functions, and uses the A* algorithm to achieve efficient call paths.

4.2.2 Requirements-Driven Adaptation.

Requirement-driven adaptation puts the emphasis on the requirements as the driver of adaptation, treating them as first-class citizens. Notable methods include RELAX, a language that facilitates the relaxation of requirements to address uncertainties [Whittle et al., 2009], and awareness and evolution requirements reified in the ZANSHIN framework, which introduced meta-requirements for determining adaptation and its actual execution respectively [Silva Souza et al., 2011]. We explore the potential of GenAI through three key aspects of requirement management: specification, operationalization, and change.

Requirement Specification. Specifying requirements involves defining the objectives that the system should fulfill. Central to self-adaptation are quality requirements [Weyns et al., 2012a]. In this context, LLMs may significantly alleviate the modeling burden. For example, LLMs have been used to convert requirements expressed in natural language into formal specification languages such as Linear Temporal Logic (LTL) or a user-given domain-specific model language, as demonstrated in Izquierdo et al. [2024] and Yang et al. [2024b].

Requirement Operationalization and Traceability. This aspect refers to aligning or synchronizing system elements with dynamic requirements, which is essential in requirement-driven adaptation [Sawyer et al., 2010]. As the traceability between high-level goals and components has been discussed in architecture-based adaptation, here we discuss the linking within requirements and the linking from requirements to the code level. For linking within requirements, Preda et al. [2024] applies LLM to the task of High-Level to Low-Level Requirements Coverage Reviewing, verifying LLM’s high understanding ability in mapping between high-level abstract requirements and low-level scenario-specific requirements. For linking to the code level, T-BERT [Lin et al., 2021] effectively creates trace links between source code and natural language artifacts, achieving F1 scores between 0.71 and 0.93 across various datasets. Similarly, BERT4RE [Ajagbe and Zhao, 2022] fine-tunes BERT to support establishing requirements traceability links for a wide range of requirements.

Requirement Change. Requirement change is a crucial aspect of an adaptive system’s capability to modify its objectives based on changes, particularly in the environmental context, representing a significant challenge within requirement-driven adaptation [Weyns, 2020]. LLMs have shown promising potential in addressing this challenge from three perspectives: Firstly, LLMs have been extensively utilized in RL, particularly in dynamic and complex environments, with a focus on reward design and reward shaping. These models have demonstrated capabilities that surpass manually designed rewards. For instance, Kwon et al. [2023] validates the consistency between LLM-generated rewards and user’s objectives under zero-shot or few-shot conditions. Xie et al. [2024] emphasizes generating dense reward functions based on natural language descriptions of system goals and environmental representations. These ideas can be directly applied to dynamic requirement adjustments in adaptive systems. Secondly, requirements extraction and analysis often require inputs from multiple perspectives, including end-users, engineers, and domain experts. To address this, Nakagawa and Honiden [2023] proposed a multi-LLM agent framework that enables LLM agents to assume various roles and iteratively refine system requirements through discussions. Originally designed for the requirements engineering phase, this framework is equally applicable to runtime requirements adaptations by equipping agents with up-to-date runtime context. Moreover, in situations involving requirements conflicts, negotiation or debate-based approaches [Chan et al., 2024a; Hunt et al., 2024] have shown to be potentially more effective than traditional discussion methods. Finally, leveraging LLMs’ capabilities for natural language interaction allows them to effectively capture and integrate user preferences based on runtime feedback into the system’s requirements. This aspect is not discussed here but is covered in detail in Section 5.1.

Additionally, Transformers and LLMs have also been used for requirement classification [Hassani, 2024; Luo et al., 2023; Mehder and Başak Aydemir 2022; Varenov and Gabdrahmanov 2021], dependency classification [Deshpande et al., 2021], and inconsistency detection [Fantechi et al., 2023; Feng et al., 2024]. These automation techniques could potentially assist in extending requirements engineering into the runtime phase as used in SASs.

4.2.3 Guarantees under Uncertainty.

“Guarantees Under Uncertainty” focuses on ensuring that an SAS complies with its adaptation goals despite the inherent uncertainties it faces. Formal verification techniques such as quantitative verification [Calinescu et al., 2011], statistical model checking [Weyns and Iftikhar, 2023], and proactive adaptation using probabilistic model checking [Moreno et al., 2016] are extensively studied for their abilities to provide evidence of assurance compliance with its requirements at runtime.

To the best of our knowledge, there is no research on using LLMs to directly enhance verification processes. However, several studies demonstrate how LLMs can automate or assist the modeling activities for model checking, potentially lowering entry barriers for developers. For instance, Yang and Wang [2024] employs LLMs to convert natural language network protocol descriptions into quantifiable dependency graphs and formal models, aiding the formal verification of next-generation network protocols. Some other studies aimed at converting natural language into LTL format specifications [Izquierdo et al., 2024; Mavrogiannis et al., 2024; Yang et al., 2024b].

Furthermore, the use of LLMs in theorem proof (in both the context of mathematics and programs) has also seen initial efforts. Welleck et al. [2022] fine-tunes GPT-3 for Mathematical Proof Generation, verifying its correctness rate of about 40% in short proofs (2–6 steps). Han et al. [2022] extract training data from kernel-level proof to improve the Transformer’s (next-step proof) tactic prediction, addressing the scarcity of training data for formal theorem proof. Thor [Jiang et al., 2022] allows a language model-based theorem prover to additionally call automated theorem provers (namely hammers [Lukasz Czajka and Kaliszyk, 2018]) for premise selection, achieving performance comparable to existing SOTA while reducing computational demand. First et al. [2023] proposes Baldur, a fine-tuned LLM for generating entire proofs, which proves to be as effective as search-based techniques but without the associated high costs. Baldur has also demonstrated capabilities in proof repair by utilizing additional context from previous failed attempts and error messages, proving an additional 8.7% of theorems compared to Thor. Additionally, [Wu et al., 2022a; Zhou et al., 2024c] then attempt to automatically define mathematical problems into formal specifications such as Isabelle (a formal theorem proving environment). Regarding program verification, LEMUR [Wu et al., 2023] combines LLMs and automated reasoners, where LLMs are employed to propose program invariants in the form of sub-goals, and then reasoners are used to verify their Boolean properties. Yao et al. [2023b] explore the use of LLMs to synthesize invariants and other proof structures necessary for demonstrating program correctness within the Verus framework, significantly reducing the effort required to (manually) write entry-level proof code.

4.2.4 Control-Based Software Adaptation.

Control-based adaptation leverages the mathematical principles of control theory to implement and analyze adaptive systems, ensuring their key properties are maintained [Shevtsov et al., 2018]. A pioneering work in this area is the so called push-button methodology that automatically generates and adjusts a controller at runtime [Filieri et al., 2014]. However, to date, to the best of our knowledge, there is currently no research on using LLMs to directly augment control theory or its direct applications.

As outlined in [Weyns, 2020], the application of control theory to software adaptation encounters two main challenges: (i) the difficulty in precisely formulating a system model, particularly the mathematical model (usually linear) that captures the dynamic behavior of software systems, including defining key variables and the equations that govern their interactions; and (ii) the challenge of defining bidirectional mapping between SE’s non-functional requirements (such as performance and cost) and control theory’s properties (such as stability and overshoot). Given LLMs’ vast knowledge base regarding software, and their capability to identify important feature variables [Hollmann et al., 2023], LLMs could potentially contribute to overcoming these challenges.

4.2.5 Learning from Experience.

Learning from experience in SASs refers to the use of Machine Learning (ML) techniques to manage the growing scale and increasing complexity of uncertainty [Gheibi et al., 2021a]. A representative example is reducing large search or adaptation spaces, thereby enabling formal methods to efficiently complete analysis and planning within a designated time window [Gheibi et al., 2021b; Jamshidi et al., 2019]. We present three potential aspects of integrating LLMs and diffusion models to enhance ML applications in SASs: (i) using LLMs to boost ML model performance, (ii) utilizing LLMs to improve RL, and (iii) employing LLMs or diffusion models to reduce the adaptation space.

Enhancing ML. The literature in this domain can be categorized into four types, each aiming to automate different aspects of ML: (i) ML pipeline generation: Literature such as [Xu et al., 2024b; Zhang et al., 2023b] focuses on automating the entire ML pipeline, from data processing to model architecture and hyperparameter tuning, enhancing overall ML performance. (ii) Data annotation: [Ding et al., 2022] explores the performance of GPT-3 in automating data labeling. (iii) Algorithm and model selection: MLCopilot [Zhang et al., 2024f] applies experiential reasoning to recommend effective models for new tasks by analyzing historical data on task performance, code, and accuracy. (iv) Feature engineering automation:⁴ Tools like CAAFE [Hollmann et al., 2023] automate feature engineering by generating context-aware features based on dataset characteristics and iteratively updating features based on performance feedback. Integrating LLMs into the ML model construction process can not only reduce the manual effort required in ML model construction but could also potentially improve the model’s performance. Additionally, such LLM-based automated ML also has the potential to facilitate lifelong learning and model updates at the runtime phase [Gheibi and Weyns, 2024; Silver et al., 2013].

Enhancing RL. RL is highly effective for planning in dynamic environments as it models decision-making through a sequence of actions designed to maximize long-term rewards [Kim and Park, 2009; Li et al., 2022b; Zhang et al., 2021].

LLMs have been used to augment RL in the following ways: (i) Reward function: As previously discussed in the Requirement Adaptation (Section 4.2.2), LLMs can automate the design of reward functions, demonstrating higher performance and faster convergence speed than expert-designed reward function [Kwon et al., 2023; Sun et al., 2024d; Xie et al., 2024; Yu et al., 2023a]; (ii) Providing sub-goals or skills: LLMs can utilize their high-level planning abilities to guide RL agents by defining intermediate tasks. Exploring with LLMs [Du et al., 2023], for example, encourages agents to explore strategically significant behaviors, like locating a key before attempting to open a door. Relevant studies include [Dalal et al., 2024; Ma et al., 2024b; Melo, 2022; Rocamonde et al., 2024; Shukla et al., 2024; Tan et al., 2024; Zhang et al., 2023d, 2023f]. This type of study could enhance the performance of RL in scenarios that require multiple skills or long-term planning; (iii) Policy: LLMs or Transformers can decrease the expenses associated with offline RL training by directly serving as demonstration policies [Carta et al., 2023; Szot et al., 2024; Wang et al., 2022]; and (iv) State representation or quality function: Transformer could serve as representation of state [Hu et al., 2021; Lee and Moon 2023; Parisotto et al., 2020; Yang et al., 2022; Zhang et al., 2023d] or quality function [Chebotar et al., 2023; Gallici et al., 2023] to enhance the performance, scalability, and transferability of RL.

Diffusion models have also been explored for enhancing RL, serving in three different roles: (i) Data synthesizer: Diffusion models are employed to synthesize data for training due to the prevalent issue of data scarcity. Multi-Task Diffusion Model [He et al., 2023] leverages the extensive knowledge available in multi-task datasets, performing implicit knowledge sharing among tasks, with experimental results indicating significant enhancements in generating data for unseen tasks. (ii) Policy: Diffusion-QL [Wang et al., 2023c] innovatively employs a conditional diffusion model to express policies, integrating Q-learning guidance into the reverse diffusion chain to optimize action selection. Kang et al. [2023] enhances the sampling efficiency of Diffusion-QL by strengthening the diffusion policy. Similarly, Chen et al. [2023a] decouples policy learning into behavior learning and action evaluation. This approach allows for improving policy expressivity by incorporating the distributional expressivity of a diffusion-based behavior model; (iii) Planner: Diffusion models serve as planners, enhancing model-based RL by estimating action sequences that maximize cumulative rewards [Ni et al., 2023]. Detailed methodologies are discussed in Section 4.2.8.

Adaptation Space Reduction via LLMs. LLMs’ extensive knowledge also offers opportunities to reduce or condense the analysis and planning space of SASs semantically. Nottingham et al. [2023] applies LLMs to hypothesize, verify, and refine an Abstract World Model (AWM), thus abstracting the state space to enhance the training efficiency of RL agents. Rana et al. [2023] uses semantic search in robot planning tasks involving multiple floors and rooms to prune the planning space, thus speeding up traditional planning techniques.

4.2.6 Enhancing Existing Planning Techniques.

This section explores how LLMs have the potential to enhance four existing planning methods.

Search-Based Planning. Search-based planning involves algorithms that systematically explore spaces of possible actions or configurations to identify sequences that achieve specific goals [Harman et al., 2012]. The design of heuristics to improve the practicality and efficiency of these searches is a key focus. For instance, Yu et al. [2023b] proposes a Graph Transformer as a heuristic function for Multi-Agent Planning, which can be trained in environments with fewer agents and generalized to situations with more agents. For LLM, Shah et al. [2023] utilizes “semantic guesswork” as a guiding heuristic for robot planning, such as guiding the robot to head to the kitchen for the task “find gas stove”. Dai et al. [2024] uses LLM to generate and translate multi-resolution (i.e., hierarchical) LTL, such as building, floor, and room as different resolutions, within a multi-resolution multi-heuristic A* algorithm. LLAMBO [Liu et al., 2024a] utilizes the knowledge of LLMs to enhance zero-shot warmstarting in Bayesian optimization.

Evolutionary Algorithms (EAs). Although EAs are a form of search method, they are discussed separately here due to their distinct characteristics and widespread application. EAs, inspired by natural evolution and genetics, are known for their global search capabilities and adaptability to various problem types [Li et al., 2024a; Mc Donnell et al., 2023]. Enhancements via LLMs in EAs focus on search operators like LLM-based crossover, mutation, and selection [Cai et al., 2024b]. A representative example, [Liu et al., 2024b], demonstrates how LLMs can first select parent solutions from the current population, and then facilitate crossover and mutation processes to generate offspring solutions. The experiments indicate achieving competitive performance in small-scale, single-objective problems like the traveling salesman problem with 20 nodes. Similarly, Guo et al. [2024c] employs LLMs as evolutionary search operators to automatically generate optimization algorithms for the traveling salesman problem, showing that LLM-generated heuristic algorithms surpass traditional greedy heuristics. Yang and Li [2023a] proposes a decomposition-based multi-objective EA framework, using LLMs to manage the reproduction of individuals within decomposed subproblems.

Game Theory. Game theory provides a mathematical framework to analyze strategic interactions among rational decision-makers and is extensively applied in adversarial settings, such as security [Chan et al., 2024b; Li et al., 2024b]. Leveraging the natural language and understanding capabilities of LLMs, game theory can now be “realized” directly through natural language instead of mathematical definitions, broadening its application scope to include areas like social simulation. Fan et al. [2024a] conducted a systematic analysis of LLMs’ rationality in game theory, assessing their performance in three classical games focused on (a) clear desire, (b) belief refinement, and (c) optimal actions. The study highlighted that even advanced models like GPT-4 require enhancements in these areas. Furthermore, developments in game theory benchmarks and platforms have been made to better evaluate LLMs’ game-playing capabilities. Challenges remain, as [Fan et al., 2024a] pointed out, particularly in strengthening the rationality of LLMs in game-theoretic settings. Enhancing LLMs’ performance through targeted prompt engineering, such as incorporating explicit desire and belief information, could significantly improve their rationality. Additionally, while traditional game theory still relies on mathematical definitions, the efficacy of LLMs within this conventional framework has yet to be fully ascertained.

Swarm Algorithm. Inspired by biological phenomena such as ant colonies and fish schooling, swarm intelligence focuses on the collective behavior of decentralized, self-organized systems and has recently seen renewed interest by the research community [Bozhinoski, 2024]. The integration of LLMs into swarm intelligence is still nascent, with [Pluhacek et al., 2023] being the only study we found in our review. This research explores the automation of hybrid swarm intelligence optimization algorithms using LLMs, tackling the challenge posed by the exponential growth in the number of hybrid (swarm) algorithms due to the diversity of base (swarm) algorithms.

4.2.7 Language Model as Planner.

Given the above background, LLMs’ reasoning capabilities and broad knowledge further position them as potentially powerful, generalized planners. We outline four unique paradigms in LLM-based planning:

Transofrmer as Planner. Prior to the adoption of LLMs for planning, several studies already conceptualized planning as a sequence modeling problem, thereby allowing the use of Transformers as planners. Decision Transformer (DT) [Chen et al., 2021a] is a foundational work in this area. It aligns with RL and trains a Transformer to output optimal actions based on expected returns (rewards), past states, and actions, achieving performance that surpassed the then state-of-the-art model-free offline RL methods. From this foundation, many improvements have been derived: Online DT [Zheng et al., 2022] further combines offline pre-training with online fine-tuning, Weighting Online DT [Ma and Li, 2024] introduces an episodic memory mechanism to enhance sample efficiency during online fine-tuning. Multi-Game DT is trained on large, diverse datasets, enabling near-human performance in up to 46 Atari games. Generalized DT [Furuta et al., 2022] addresses a wide range of “hindsight information-matching problems,” such as imitation learning and state-marginal matching. Hyper-DT [Xu et al., 2023] incorporates an adaptation module into DT, which uses a hyper-network to initialize its parameters based on task demonstrations, effectively adapting to new tasks. Constrained DT [Liu et al., 2023b] achieves dynamic adjustments between safety and performance during deployment. Q-learning DT [Yamagata et al., 2023] enhances DT performance when only sub-optimal trajectories are included in the dataset by using dynamic programming (Q-learning) to label training data. Zhu et al. [2023c] decomposes long-delayed rewards into each timestep, where the decomposition of rewards is described as a globally optimal bi-level optimization problem, thereby enhancing the performance of DT in settings with delayed rewards. It is important to note that these studies can also be viewed as a new realization of RL, where Transformer pre-training is employed to replace traditional methods of fitting value functions or computing policy gradients.

Additionally, Transformers have been utilized as planners in the following applications. Yang et al. [2023a] trains a Recurrent Transformer to enable logical reasoning on constraint satisfaction problems. Takagi [2022] explores the impact of different modalities on Transformer performance, investigating why models pre-trained on image data perform poorly. TIMAT [Kang et al., 2024] extracts temporal information and models Multi-Agent RL (MARL) as a sequential model, its advantage is its ability to plan for an arbitrary number of agents. MetaMorph [Gupta et al., 2022] trained Universal Controllers for exponentially morphable modular robots, demonstrating the Transformer’s combinatorial generalization capabilities.

Collective Intelligence. Collective intelligence, also referred to as crowdsourcing or self-collaboration in some literature, utilizes the wisdom of crowds to achieve consensus-driven decision-making through discussion, debate, or voting [Ferreira et al., 2024]. Here, multiple agents or roles are often enabled by various fine-tuned LLMs or prompted by different contexts. Zhang et al. [2023c] integrates the Actor-Critic concept from RL into LLM multi-agent crowdsourcing, highlighting its potential to cut hallucinations and reduce token usage costs. RoCo [Mandi et al., 2024] promotes information exchange and task reasoning among robots in multi-robot planning by facilitating discussions. Shi et al. [2024b] offers a concept that is similar to the MAPE loop, involving three agents working together to complete tasks, which include (i) observing to collect environmental data, (ii) decomposing instructions for planning, and (iii) using skills to execute tasks. Chen et al. [2024e] explores automated expert recruitment (deciding what kind of domain expert is needed for the task and then generating their persona) and various forms of crowdsourcing (democratic or hierarchical). Guo et al. [2024b] evaluates the impact of designated leadership in LLM-agent organizations, demonstrating some interesting results include (a) in small teams, higher efficiency can be achieved with less communication cost; (b) agents can elect their own leader and dynamically adjust leadership via communication; and (c) agents spontaneously engage in activities that mimic human behaviors, such as reporting task progress to the leader agent. This study also introduces a criticize-reflect framework to evaluate and adjust organizational structures. Dong [2024] explores the high costs and negative impacts of misinformation in large-scale democratic discussions. This paradigm offers new decision-making avenues, which may be particularly suitable for decentralized SASs [Weyns et al., 2013].

Experience Accumulation. Experience accumulation, also called lifelong learning in some studies [Silver et al., 2013], enables agents to use LLMs to gather experience from both failures and successes, learning to improve future planning.

For failed experiences, LLMs or human analyses can identify the causes of failures, reflecting on these insights and integrating them into future planning cycles. This approach is also known in some studies as planning with feedback or self-reflection. Madaan et al. [2022] records instances of LLM misunderstandings along with user feedback, enhancing prompt accuracy for future queries by integrating past clarifications. Li et al. [2022a] refers to this as an “active data collection process,” iterating strategies through interactions with the environment based on past failed experiences. Huang et al. [2022] refers to this process as “inner monologue”. Wang et al. [2023a] introduces the Describe, Explain, Plan, and Select framework, where an LLM describes the plan execution process and provides self-explanations upon encountering failures, facilitating effective error correction. Zhang et al. [2024c] propose the Prompt Ensemble learning via Feedback-Reflect-Refine method, which uses a feedback mechanism to reflect on planning inadequacies and generates new prompts for iterative refinement. Yang et al. [2024c] treats LLMs as optimizers to solve optimization problems described in natural language, where previously generated solutions and their outcomes are used to prompt the LLM to generate new solutions.

For successful experiences, LLMs store these in memory or a skill pool for later retrieval and reuse in similar scenarios. Zhu et al. [2023a] introduces a three-step process for LLM-based memory reuse: (a) during each game scenario, once the goal is achieved, the executed plan is stored; (b) summarizing common reference plans from multiple scenarios for more generalized situations; and (c) creating new plans based on these reference plans when similar goals arise. Over time, as these summaries accumulate, the effectiveness of the LLM-based planner increases. Similarly, Zhao et al. [2024] propose ExpeL (Experiential Learning), which enhances task success rates through experience gathering and insight extraction. LLMs As Tool Makers (LATM) [Cai et al., 2024a] approaches from a tool maker’s perspective, enabling LLMs to create and utilize tools, which are implemented as Python functions. Moreover, LATM attempts to utilize different LLMs to create tools of varying complexity, thereby reducing the cost of tool production. AdaPlanner [Sun et al., 2023b] introduces skill filtering, which involves comparing the performance of including versus not including past successful experiences in prompts to determine the generalizability of these experiences.

Optimizing Prompting for Black-Box LLMs. Prompt engineering is crucial in maximizing the planning capabilities of LLMs as it directly impacts the model’s understanding and response to tasks [Sahoo et al., 2024]. However, LLMs often operate as a black box to users, particularly in the context of LLM as a service (e.g., accessing LLMs through an API). Beyond the previously discussed prompt patterns such as CoT, self-consistency, ToTs, and GoT, recent studies have treated prompt design as an optimization problem to enhance the LLM’s planning performance. These studies can be categorized into four types: (i) RL-based optimization: TEMPERA [Zhang et al., 2023e] treats prompt optimization as an RL challenge, where the action space includes editing instructions, in-context examples, and verbalizers. The rewards are gauged by the performance improvements from these edits. Similarly, RLPrompt [Deng et al., 2022] trains a policy network to generate effective prompts, noting that optimized prompts sometimes appear as “gibberish” that defies standard grammatical conventions. Additionally, Prompt-OIRL [Sun et al., 2024b] leverages an expert dataset and inverse RL to derive a reward model that facilitates prompt evaluations; (ii) Evolutionary Algorithm (EA)-based optimization: Employing EAs for gradient-free prompt optimization, several methodologies have emerged. Gradient-free Instructional Prompt Search [Prasad et al., 2023], Genetic Prompt Search [Xu et al., 2022a], and EvoPrompt [Guo et al., 2024c] utilize the robust optimization capabilities of EAs. InstOptima [Yang and Li, 2023a] extends this approach by considering multi-objective goals, evaluating both performance and additional metrics like instruction length; (iii) Incorporating classic planning ideas into prompt: Classic planning principles have also been integrated into prompt engineering. PromptAgent [Wang et al., 2024a] treats the design space of prompts as a planning problem and uses Monte Carlo Tree Search to strategically explore high-quality prompts, where experiences of failure during interaction with the environment are used to define the rewards in the search. Hazra et al. [2024] introduces the SayCanPay framework, where LLMs (a) generate candidate actions based on a goal and initial observation (“Say”), (b) an affordance model evaluates the feasibility of these actions (“Can”), and (c) the most feasible and cost-effective plan is selected using a combined score as a heuristic (“Pay”). Here, Can and Pay are independent models that require domain-specific training to ensure the alignment of plans with the current environment. Furthermore, combining hybrid planning (“fast and slow”) [Pandey et al., 2016] and hierarchical planning, Lin et al. [2023a] and Liu et al. [2024f] employs a dual-LLM framework where a detailed, reasoning-focused LLM (“slow mind”) for detailed planning or teammate’s intentions interpretation, and a lightweight LLM (“fast mind”) generates reactive policies and macro actions; and (iv) Self-adaptive prompting: Self-adaptive prompting refers to an approach tailored for zero-shot learning, designed to automatically optimize prompt design. The concept involves initially using LLMs to generate pseudo-demonstrations in a zero-sample setting. Generally, several candidates for pseudo-demonstrations are first generated, and the most effective are then selected for implementing ICL based on metrics such as consistency and logit entropy. Key studies include Consistency-based Self-adaptive Prompting (COSP) [Wan et al., 2023a] and Universal Self-adaptive Prompting (USP) [Wan et al., 2023b]. Experimental results indicate that COSP enhances performance by an average of 15% over the zero-shot baseline, and both COSP and USP have demonstrated comparable or even superior performance to few-shot baselines in certain tasks.

4.2.8 Diffusion Model as Planner.

Diffusion models have recently been applied for use in planning tasks. Janner et al. [2022] pioneered this approach by reinterpreting diffusion-based image inpainting as a method for coherent planning strategies, demonstrating the model’s capability in long-horizon decision-making and its adaptability to unseen environments, as demonstrated in 2D maze experiments. Subsequently, diffusion has been extensively applied in motion planning for robotic arms [Mishra and Chen, 2023; Pearce et al., 2023; Ze et al., 2024] and quadruped robots [Liu et al., 2024c], as well as continuous constraint solvers [Yang et al., 2023c].

Additionally, further developments have been made in enhancing different aspects of diffusion models. For enhancing long-range decision-making capabilities, Generative Skill Chaining [Mishra et al., 2023] introduces a method where individual skills are modeled as separate diffusion models and sequentially chained to address long-horizon goals. This chaining process involves generating post-condition states of one skill that satisfy the pre-conditions of the subsequent skill. Regarding uncertainty-aware planning, Dynamics-informed Diffusion [Cachay et al., 2023] couples probabilistic temporal dynamics forecasting with the diffusion steps, and PlanCP [Sun et al., 2023a] quantifies the uncertainty of diffusion dynamics models using Conformal Prediction and modifies the loss function for model training. Chen et al. [2024b] introduces a hierarchical diffuser strategy that employs a “jumpy” high-level planning technique with a broader receptive field and reduced computational demands, effectively directing the lower-level diffuser through strategic sub-goals. Similarly, Li et al. [2023d] proposes a hierarchical diffusion method, which includes a reward-conditional goal diffuser for subgoal discovery and a goal-conditional trajectory diffuser for generating the corresponding action sequence of subgoals. Zhou et al. [2023a] focuses on online replanning, where the timing of replanning is determined based on the diffusion model’s estimated likelihood of existing generated plans, and the replanning is based on existing trajectories to ensure that new plans follow the same goal state as the original trajectory. Jin et al. [2023] introduces a hierarchical semantic graph for fine-grained control of generation, including overall movement, local actions, and action details, to improve the granularity of generated controls.

Summary—Analyzer and Planner. GenAI techniques offer significant potential in supporting analysis and planning of SASs. In architecture-based adaptation and requirement-driven adaptation, LLMs have potential to support reasoning based on natural language or unstructured data, potentially broadening their application scope. For the application of learning in analysis and planning, LLMs and Diffusion models could support generating prior knowledge, enhancing model performance and reducing training/planning costs. For providing guarantees under uncertainty and control-based adaptation that rely on strict mathematical frameworks, LLMs’ translation capabilities may have the potential to reduce the modeling costs associated with using these methods. Furthermore, interesting new planning paradigms for LLMs and Diffusion have emerged: (i) Transformer-based planning methods have strong advantages in offline RL and scalability, potentially suitable for offline learning (i.e., inability to interact with the real environment during training) and large-scale adaptive systems; (ii) Collective Intelligence explores how multiple agents can collaborate and make decisions, offering potential methods for distributed SASs; (iii) Experience accumulation shows a paradigm similar to self-reflection (for failed experiences) and self-evolution (for successful experiences), which can inform lifelong learning and self-evolution for SASs; (iv) diffusion models provide a planning diagram tailored for high-dimensional and complex constraints.

4.3 Executor

The executor is crucial for enacting the adaptation plan on the managed system, with its specific roles and implementation varying based on the design and the division of responsibilities between the managed and managing systems [Weyns, 2020]. For example, consider a mobile robot with an adaptation plan to “change the movement to the destination.” Here, the executor’s involvement can differ significantly depending on the case at hand: (i) it might simply relay destination coordinates to the managed system, which autonomously completes the movement, or (ii) it might convert the high-level plan into a detailed path or even low-level control parameters for the managed system.

In simpler scenarios like the first, the executor’s role is straightforward, offering limited scope for GenAI to add value. However, in more complex tasks like the second scenario, where translating a high-level plan into specific actions or configurations is required, some research based on Transformers or LLMs has demonstrated their potential for end-to-end transformation. For instance, Google’s LM-Nav [Shah et al., 2022], RT-2 (Robot Transformer 2) [Brohan et al., 2023] and PaLM-E [Driess et al., 2023] are representative works in this area. All of them are called vision-language-action models, enabling the interpretation of user commands such as ‘pick up the biggest object’ and corresponding robot observations to directly initiate appropriate robot actions.

In relation to the execution stage in SASs, research areas like embodied agents and robotics are particularly focused on the (M)LLM’s capabilities of “physically-grounding.” For example, Gao et al. [2024] fine-tuned an MLLM to understand the physical properties of objects (e.g., material, fragility) to improve the success rate of execution.

Summary—Executor. Considering that the implementation of execution in SASs is often straightforward, LLMs offer limited benefits. Yet, for more complex cases where the executor needs to convert plans Transformers and LLMs have potential to support end-to-end transformation. Additionally, studies in robotics still demonstrate the capability of MLLMs to successfully execute given plans in uncertain environments.

4.4 Knowledge and Runtime Models

In SASs, knowledge reified as runtime models [Blair et al., 2009; Garlan et al., 2004; Weyns et al., 2012b] serves as a critical runtime abstraction of that system or any aspect related to that system that is used for the purpose of realizing self-adaptation [Weyns, 2020].

Fig. 5.

Our survey reveals that existing literature primarily employs LLMs for three distinct formats of knowledge: (i) Knowledge graphs: Here, language models are used in two different ways. First, LLMs trained on extensive text corpora act as implicit knowledge bases, such as COMET [Bosselut et al., 2019] and BertNet [Hao et al., 2023], enable the re-extraction of knowledge graphs from LLMs. To improve the precision and robustness of knowledge distillation, Walker et al. [2024] investigate interactions and responsibilities between LLMs and stakeholders (knowledge engineer), and [Potyka et al., 2024] use methods derived from social choice theory to adapt and aggregate ranking queries. Second, LLMs serve as tools to translate information, for instance, Ringwald [2024] translate Wikipedia pages into Resource Description Framework graphs, which consist of subject-predicate-object triples; (ii) System modeling: In studies of SE, LLMs are explored for generating diverse models like requirement models [Hiroyuki Nakagawa, 2023] and architectural models [Hong et al., 2024b]. Additionally, LLMs can transform natural language into Domain-Specific Modeling Languages (DSML) such as LTL [Mavrogiannis et al., 2024; Yang et al., 2024b], Backus-Naur Form [Wang et al., 2023f], and Planning Domain Definition Language (PDDL) [Ding et al., 2024; Guan et al., 2023; Zhou et al., 2024b], reducing the manual effort involved in modeling. (iii) World models: In the studies of robotics, LLMs are extensively applied to create world models, also called planning spaces. For instance, LLMs are used to generate “explicit world models” in the PDDL, and enable human corrections based on natural language instructions [Guan et al., 2023]. Similarly, Nottingham et al. [2023] utilizes LLMs to develop an “AWM” for planning and exploration (called “dream phase”). Subsequently, the RL agent learns and corrects the AWM based on the plans (called “wake phase”), thereby improving the sample efficiency of learning. Furthermore, LLMs are also used to generate other paradigms of planning space, such as behavior trees [Saccon et al., 2024; Sakib and Sun, 2024; Zhou et al., 2024a].

Summary—Knowledge. LLMs offer two primary potential benefits in the realm of knowledge and runtime models. The first benefit is their capacity to establish models by leveraging their extensive, inherent knowledge. However, these models often require further alignment with real-world scenarios, through manual adjustments or LLM-based corrections based on feedback from actual interactions. The second benefit involves the use of LLMs’ translation capabilities to convert descriptions in natural language or other formats into DSML, thereby significantly reducing the costs associated with manual modeling.

5 Enhancing HOTL

While SASs are designed to reduce human intervention and increase automation, incorporating purposeful human interaction remains essential, in particular in relation to trustworthiness [Cámara et al., 2015; Weyns et al., 2023]. The advanced language understanding capabilities of LLMs undoubtedly offer significant potential to enhance HOTL configurations within SASs. In this section, we organize the literature based on the purpose of designing HOTL mechanisms, with each purpose corresponding to varying levels of human involvement in the operations of SASs [Barnes, 2010]. The first category is preference acquisition. Accurately capturing users’ dynamic preferences during operation is necessary for achieving better user-centered adaptation. In this context, humans primarily serve as stakeholders to be satisfied by the system. The second category focuses on transparency, which is critical for helping users understand system behavior and thereby enhancing trust. In this category, it is essential for humans to comprehend the system’s actions and intentions. The final category is collaboration, which is primarily about leveraging human expertise to correct system errors or combining the strengths of both humans and systems to achieve more complex goals. This approach requires humans to actively engage with the SAS, playing a crucial role in its operations.

5.1 Preference Acquisition

Preference acquisition is the process of gathering and interpreting user preferences to tailor system adaptations that better meet user needs [Li et al., 2023e; Zhang et al., 2024b]. This process is essential for improving user experience and personalizing system behavior, thereby enhancing user satisfaction and trust.

In this discussion, we concentrate on explicitly representable user preferences, while excluding the fine-tuning of ICL for human alignment, such as the personalized Transformer [Li et al., 2021b]. PlanCollabNL [Izquierdo et al., 2024], addressing human-robot collaboration, uses LLMs to infer the cost associated with specific user tasks from natural language inputs, such as “I have back pain today.” It translates such information as formulas in the PDDL, involving definitions of operation objects, (human) agents, and the numeric cost. Similarly, Lou et al. [2024] and Liu et al. [2024f] utilizes LLMs’ domain expertise to translate language constraints into well-defined cost functions. These functions are used to determine constraint violations, essentially functioning as the inverse of reward functions, which planning algorithms aim to minimize. Additionally, personas, as a commonly employed method for representing user characteristics, have been extensively studied to be generated by LLMs. An 11-participant user study [Schuller et al., 2024] has verified that personas generated by LLMs are virtually indistinguishable from those written by humans, exhibiting comparable quality and acceptance. Another study, [Sera et al., 2024], explores the dynamic updating of personas during runtime. This study utilizes k-means clustering along with LLMs to analyze attributes and tendencies from actual user clickstream log data. The insights gained from this analysis are then used by LLMs to refine and update manually designed personas.

Summary and Discussion—Preference Acquisition. LLMs have demonstrated potential in preference acquisition due to their common sense and language understanding capabilities. Specifically, LLMs can infer preferences expressed as hard constraints (e.g., LTL), utility functions, or personas from natural language-based user feedback or user action history. However, potential conflicts between different needs and preferences in multi-objective settings, such as the tradeoff between non-functional properties like cost and efficiency, which are core to self-adaptation, is still lacking and needs further exploration.

5.2 Transparency

Transparency in SASs, often synonymous with explainability or interpretability, involves making system operations and decision-making processes clear and comprehensible to users. This transparency is crucial as it allows users to better understand the decision-making process of adaptive systems, thereby enabling them to effectively identify errors in system decisions [Li et al., 2020b; Parra-Ullauri et al., 2022]. We categorize the related literature by the object of explanation (code, decision-making module, and log) and the form of expression.

5.2.1 Code Explanation.

Explaining how a piece of code functions is a direct method to enhance system transparency. Originally intended to boost development efficiency, these explanations could also be applied for runtime system transparency. Initially, several works have applied Transformers to Code Summarization. Ahmad et al. [2020] was an early attempt, Tang et al. [2021] introduced Abstract Syntax Tree preprocessing to reduce Transformer computational complexity, and Mastropaolo et al. [2024] fine-tuned Transformers for more granular comments (code snippets or single statements instead of method-level). Regarding LLM-based methods, Ahmed and Devanbu [2023] demonstrated the performance enhancement of Codex (GPT-3) after few-shot training specific to a project. Ahmed et al. [2024] validated from the perspective of prompt engineering that adding additional semantic facts (such as control flow, data flow) can significantly improve LLMs’ code summarization performance. Nam et al. [2024] confirmed through a user experiment with 32 participants that LLMs can assist in code understanding in an integrated development environment more effectively than web searches in helping complete coding tasks. Geng et al. [2024] examines multi-intent comment generation, where multi-intent means, for example, creating different comments to explain what the functionality is and when to use it. Furthermore, Khan and Uddin [2023] assesses Codex’s effectiveness in documentation generation.

5.2.2 Explanation of Decision-Making Modules.

Explaining decision-making processes is essential, especially when they are executed by opaque, grey- or black-box modules. de Zarzà et al. (2023) uses LLMs to interpret data from Proportional–Integral–Derivative (PID) control loops—like control parameters and errors—to elucidate the behavior of PID controllers. Pandya et al. [2024] explores explanations for game theory-based multi-agent collaborative policies (i.e., multiple Nash equilibria), by utilizing LLMs to generate visual task trajectories for different agents.

5.2.3 Log Explanation.

Logs, which record system events, processes, or communications, are often vital for understanding operational status or traceability. Liu et al. [2024d] investigates LLMs’ capabilities in anomaly detection and explanation. Despite needing improvements in detection accuracy (F1 Ave = 0.412), the explanations received high ratings for usefulness and readability from experts (six experts with over ten years of work experience, Ave = 4.42/5).

5.2.4 Advanced Visualization and Interaction.

Beyond explanatory content, the methods of visualization and interaction are critical for ensuring explanations are easily understood.

For visualization, Jiang et al. [2023c] uses LLMs to create node-link diagrams from the text by extracting entities and relationships within the text, allowing users to adjust and interact with the visual presentation dynamically. Pandya et al. [2024] explores explanations for multi-agent collaborative policies, by utilizing LLMs to generate visual task trajectories for different agents. Liu et al. [2023c] introduces visual captions, where LLMs suggest context-relevant visual graphs for the ongoing conversations (e.g., display photos of Disneyland and the beach when talking about vacation plans). Similarly, ZINify [Shriram and Pradeep Kumar Sreekala, 2023] transforms research articles into engaging magazines to enhance their comprehensibility. Additionally, LLMs are widely used in the automation of data visualization, potentially supporting the automatic construction and runtime adjustment of dashboards [Arawjo et al., 2024; Chung et al., 2022; Ko et al., 2024; Reif et al., 2024]. AnalogyMate [Chen et al., 2024d] enhances the understandability of unfamiliar data measurements and abstract representations through data analogies. An example is visualizing the size of a pile of bottles stacked up against the Eiffel Tower to explain the meaning of “1.3 billion bottles are sold daily.” For interaction, LLMs facilitate natural language-based or visual-based Q & A or control interactions [Bernstein et al., 2023]. For example, Wang et al. [2023d] shows how LLMs can manage mobile UI tasks through conversational interactions.

Summary and Discussion—Transparency. LLMs have demonstrated potential in explaining code, decision models, and system logs, as well as in creating more intuitive and understandable visualizations. However, the exploration of explaining code and decision models remains preliminary; the former typically involves only static aspects rather than dynamic behaviors of the code, and the latter often uses LLMs to directly explain decision-making models. An immediate improvement strategy involves providing LLMs with appropriate contexts for different types of decision models—white-box, gray-box, and black-box—incorporating elements such as runtime intermediate results to enhance explanation accuracy. Additionally, another promising direction is the use of LLMs for model interpretability, such as employing decision trees as surrogate models to approximate and elucidate complex deep-learning models. In this context, LLMs’ common-sense capabilities could be particularly useful in assisting with feature selection and importance analysis.

5.3 Collaboration

Human–computer collaboration involves systems actively participating alongside humans in tasks traditionally performed by people. This partnership leverages the unique strengths of both participants and dynamically adjusts based on the runtime context to enhance efficiency and effectiveness [Cámara et al., 2015; Gheibi and Weyns, 2024; Li et al., 2021a].

5.3.1 Task Allocation.

Task allocation is critical in optimizing the collaborative use of human and machine capabilities, assigning appropriate tasks to the best-suited agent [Ranz et al., 2017]. Chen et al. [2024a] explores the use of LLMs for multi-robot task planning, comparing the task success rate and token efficiency across four multi-agent communication frameworks (centralized, decentralized, and two hybrid forms) in various tasks. Their findings suggest that hybrid frameworks generally achieve higher task success rates and better scalability with an increasing number of agents. MetaGPT [Hong et al., 2024b] uses a pipeline paradigm to assign different roles to various agents, decomposing complex tasks into subtasks that involve multiple agents working collaboratively. This approach has been proven effective in the waterfall software development process from requirements engineering to testing. Similarly, Xiao et al. [2024] introduces Chain-of-Experts (CoE), where each agent is assigned a specific role and endowed with relevant domain knowledge. Additionally, CoE incorporates a conductor who coordinates these agents through a forward-thinking structure and backward reflection mechanism. Liu et al. [2024f] applies an LLM as a human-machine interface in real-time video games to implement natural language-based intent communication for task allocation.

5.3.2 Cooperative Behavior.

Cooperative behavior focuses on system agents planning and executing tasks in concert with human actions, often requiring more granular coordination than task allocation. ProAgent [Zhang et al., 2024e] uses LLMs to deduce teammates’ intentions from observed actions (called beliefs), and continuously update these beliefs. These updated beliefs then guide LLM-based planning for proactive cooperation. Tanneberg et al. [2024] introduces Attentive Support, utilizing LLMs to decide when and how robots should support humans only when needed, while remaining silent at other times to avoid disturbing users.

5.3.3 User Correction.

User Correction involves users making adjustments to the system’s outputs or processes to correct errors or enhance performance. AI Chains [Wu et al., 2022b] implements a chained processing approach where users can modify the sequence of operations and their intermediate outcomes in a modular fashion. This framework also facilitates the comparison of alternative strategies by allowing users to observe their parallel effects. Furthermore, Cai et al. [2023] integrates manual corrections into their CoT framework, applying a cost-utility analysis model to assess and balance the benefits and costs associated with these interventions. It should be noted that the processing chains or workflows described in these studies have the potential for broader applications to control flow or data flow in a variety of system domains. However, employing these methods in SASs necessitates more detailed considerations. For instance, a notable aspect is assessing whether data corrected by humans might lead to issues such as system overflow.

Summary and Discussion—Collaboration. LLMs have been applied in task allocation, cooperative behavior, and user correction, where their main function is to infer users’ intentions or behavioral patterns and plan collaborative patterns. However, the use of LLMs in these scenarios is still in its preliminary stages. Future research could potentially explore deeper avenues such as: (i) more advanced intent inference and communication, such as exploring multi-modal inputs and outputs; (ii) more in-depth analysis of user capabilities or the impact of user involvement, which could promote more efficient human-system co-adaptation. These capabilities offer substantial benefits to enhance self-adaptation.

6 Research Roadmap

We now consolidate the insights obtained from the study of the state of the art into a research roadmap. The roadmap comprises key research challenges that need to be tackled to exploit the potential for applying GenAI (particularly LLMs) in the field of SAS. Additionally, it provides a practical reflection on the current shortcomings of GenAI with possible mitigation strategies.

Figure 6 summarizes the research challenges outlined in the roadmap. The concepts on the left-hand side outline key SE aspects that need to be considered in the design and realization of SASs, as discussed in [Cheng et al., 2009; de Lemos et al., 2013]. The concepts in the middle summarize the challenges of employing GenAI and LLMs in particular in SASs, which we discussed from Section 6.1 through Section 6.9. The concepts on the right-hand side highlight the primary functions that are involved in self-adaptation with an emphasis on MAPE-K and HOTL. The relations between (i) and (ii) show which specific aspect(s) of SE of SASs each challenge of employing GenAI and LLMs may involve. The relations between (ii) and (iii) map the challenges of employing LLMs to key functions within SASs. We elaborate now on each of the research challenges of (ii).

Fig. 6.

6.1 Transfer of Design-Time Methods to Runtime Use

As widely discussed in the area of SE, LLMs have been utilized to (semi-)automate various aspects of realizing software systems. However, existing methods have primarily focused on design time, limiting their direct applicability to runtime settings. This challenge involves transferring methods initially developed for the design-time phase to be used during the runtime phase, taking into account the substantial differences between these two phases. Key gaps between design-time and runtime methods that need to be closed include: (i) Different Objectives: Design-time methods typically focus on either greenfield design or offline extending existing systems. Runtime methods on the other hand are concerned with adapting or modifying systems during operation. For instance, while design-time requirement elicitation focuses on extracting and analyzing the demands of stakeholders, runtime requirement management often involves adjusting, changing, or even evolving the initially established requirements. (ii) Different Information Sources: At design time, methods generally rely on historical data and knowledge available from domain experts. Runtime methods on the other hand can leverage online operational data as a primary information source. For instance, while the construction and explainability of a system at design time mainly relies on expert’s knowledge, assumptions and static code, the runtime phase can leverage specific, grounded observations and data obtained during execution to adapt the system. (iii) Human Involvement: At design time, decision-making is centered around stakeholders that may leverage GenAI as a supportive tool, for instance, GenAI outputs may be used for manually evaluating scenarios and conducting in-depth discussions. On the other hand, to make timely decisions at runtime, GenAI needs to take on a more autonomous role supporting on-time decision-making for adapting the system to different contexts and environmental changes.

To address (i) and (ii), a key strategy involves refining prompting approaches of GenAI and LLM methods in particular by clearly defining the runtime tasks and incorporating relevant execution information as context. For point (iii), the challenge aligns closely with the quality assurance of ML. It requires careful design of use cases for LLMs, encompassing rigorous performance evaluations in practical scenarios, and a comprehensive consideration of the overall system’s robustness to mitigate the risks associated with erroneous inputs from LLMs.

6.2 Towards LLM as a Service

While LLMs have demonstrated generalized capabilities, they often see enhanced performance when fine-tuned for specific domains, resulting in smaller models with lower usage costs and faster response times [Liu et al., 2023a]. As a response, industry has been developing numerous domain-specific LLMs; for example, an industry analysis report [Nandu Digital Economy Governance Research Center, 2023] notes that out of 190 large models in China, only 45 are general-purpose LLMs, while 145 are tailored for specific domains. Based on this, the trend towards LLMs as a Service (LMaaS) is a promising thought, where LLMs are provided on-demand as a cloud service tailored for specific domains and tasks at hand.

This emerging future of LMaaS introduces two key challenges: Firstly, within the context of architecture-based adaptation, SASs need to treat LLMs as system components akin to APIs, microservices, and ML models. This calls for enabling effective integration and management of these models within the system architecture. Secondly, and probably more critical, as LLMs become integral to various systems, they introduce new sources of uncertainty. For example, in a microservice system where each service might be powered by a task-specific LLM. However, the output of LLMs is inherently probabilistic, i.e., LLMs might produce different outcomes even for the same input. Although studies on prompt engineering and prompt optimization can effectively reduce such uncertainty, from a system-wide perspective, addressing how these LM-based components contribute to system-level uncertainties, and how to manage these uncertainties within the adaptation process or through adaptation itself becomes a paramount concern.

6.3 Observation and Representation

In SASs, observation refers to the data the system collects through monitoring, and representation refers to how this data are conceptualized, stored, and utilized as knowledge within the system. LLMs, and MLLMs in particular have shown the ability to process and interpret data across diverse modalities and fields. However, this versatility also complicates the design and management of system observations and representations.

In terms of observation, MLLMs expand the scope of data that systems can process, such as understanding unstructured, possibly multimodal data. This significantly enlarges the observation design space or modeling dimensions. When considering representation, two primary challenges emerge: Firstly, the influence of syntax in prompts on performance. Despite the ability of MLLMs to understand semantically identical information presented in different syntactic forms, the format of representation can significantly affect outcomes. For instance, studies have shown that LLMs perform better with data in HTML or XML formats than in JSON or Markdown, likely due to the structured repetition in tags such as <tr> and </tr>, which enhances attention mechanisms in Transformers [Sui et al., 2024]. Additionally, some studies report that languages more suitable for communication between LLMs (i.e., higher performance) are not necessarily human-readable [Deng et al., 2022]. Secondly, the tradeoff between the quality of the context’s semantics and the inference cost. Generally, the more contextually rich and relevant the information provided in a prompt, the higher the quality of the LLM’s output will be. However, LLMs typically have a static context window (e.g., 16k tokens in GPT-3), and the LLMs’ inference time also increases with the length of the context. Therefore, the mechanism to store and select appropriate context, e.g., knowledge graph-based RAG [Microsoft, 2024]), and how to (dynamically) compress context [Li et al., 2023a] may be potential research challenges.

In addition, another similar topic is the use of LLMs to integrate human feedback into the SASs. Such integration requires careful consideration of how and where this feedback is assimilated into the runtime model, for instance, incorporating preference acquisition typically into requirement models; and integrating human-as-sensor feedback into environmental models.

6.4 Towards LLM-enhanced Decentralized Control

The notion of decentralization of control in SASs, by dividing the responsibilities of the MAPE-K functions across modules, is well-documented [Weyns et al., 2013]. We examine such decentralization from the perspective of LLM-based and LLM-enhanced methods.

In LLM-based methods, specifically the use of LLMs as planners, the notion of collective intelligence has shown promise in distributed planning within LLM Multi-Agent Systems (LLM-MAS). However, there are critical areas for further exploration: Firstly, the common practice of treating each agent as an “independent and complete individual” without explicitly sharing observations and experiences, while mimicking human interaction as proposed in the basic study “Generative Agents: Interactive Simulacra of Human Behavior” [Park et al., 2023], inherently limits efficiency. For software systems in general, LLM-MAS could potentially adopt more efficient interaction methods, such as direct data transmission rather than through dialogue, and advanced planning techniques that consider synergistic effects from multi-agent interactions. Therefore, optimizing collective intelligence for different problem settings (e.g., depending on the type of communication that is possible) remains a direction worth exploring. Additionally, the scalability of LLM-MAS is currently limited (typically up to 5 agents), and as the number of agents grows, the communication costs increase exponentially. Hence, another valuable research direction is coordinating large-scale LLM-MAS, e.g., leveraging LLMs to generate efficient communication protocols.

For LLM-enhanced methods, such as those incorporating RL and search-based planning, current research predominantly focuses on enhancing single-agent planning [Pternea et al., 2024]. LLMs are employed to inject knowledge or commonsense into RL as reward functions and into search algorithms as heuristics. However, leveraging LLMs, such as generating reward functions for cooperation to improve multi-agent planning in frameworks like MARL holds substantial promise and warrants further investigation. Additionally, research is required to transfer such agent-based solutions to MAPE-K based solutions.

6.5 Towards Adaptive and Personalized Interaction

A common setup in existing HOTL studies is in safety-critical domains, where humans are often experts who are well-versed in domain knowledge and system interaction. An example is data acquisition within a control service scenario [Cámara et al., 2015], in which human operators are tasked with adjusting configurations. However, as SASs are penetrating daily life and serve more general end-users [Weyns et al., 2023], these assumptions may not hold, introducing new uncertainties related to the users’ knowledge and interaction capabilities. Previous research has begun to address these human uncertainties [Li et al., 2020a], integrating them into the planning problem in a formal way, but it remains unexplored from a HCI perspective.

LLMs provide a promising avenue for enhancing adaptive and personalized interactions in a flexible and low-cost way. Firstly, LLMs facilitate a deeper understanding of user preferences and behaviors, as explored in Section 5.1. Secondly, the generative nature of LLMs allows for the customization of interactions, such as tailoring explanations to a user’s domain familiarity or adapting user interfaces to the specific task and user context [Huang et al., 2024; Madugalla et al., 2024]. We anticipate that the exploration of this research direction will greatly expand the application fields of HOTL in SASs.

6.6 Ethics and Responsibility

The rise of GenAI introduces new ethical challenges, including potential impacts on the job market [Ghosh and Fossas, 2022], issues of credit allocation between GenAI and humans [Roose, 2022] (e.g., attributing contributions in GenAI-created artworks), and biases in generated content.

In systems that make autonomous decisions, such as SASs, the focus is the influence on attributing and defining decision-making responsibility. The concept of a “responsibility gap” [Sio and Mecacci, 2021] highlights the difficulty in assigning accountability as decisions by machines become more autonomous and complex, blurring the lines between the responsibilities of human operators, developers, or the machines themselves. This is rigorously discussed in contexts like autonomous weapon systems [Wood, 2023], where documents such as “Autonomy in Weapon Systems” [U.S. Department of Defense, 2023] discuss appropriate human involvement and handling of ethical issues during design and deployment. While SASs might not directly involve human lives like weapon systems, incorrect adaptations and misguidance (such as wrong explanations) can still lead to significant economic losses and performance degradation.

Here, the role of LLMs poses a dual challenge. On the one hand, in a gray- or black-box way, LLMs enhance system autonomy and adaptability in a non-transparent way. On the other hand, LLMs further complicate responsibilities while enhancing the process of human-machine interaction [Ehsan et al., 2024]. Discussing and clarifying the attribution and definition of responsibilities under this dual challenge is an important challenge for future research.

6.7 Artifacts for Evaluation

Artifacts, including datasets, benchmarks, and exemplars, play a critical role in driving, communicating, and evaluating research in SASs [Weyns et al., 2022b]. In LLM studies, diverse artifacts like the physics engine MUJOCO [Todorov et al., 2012], META-WORLD for multi-task robot learning [Yu et al., 2019], WebShop for information retrieval [Yao et al., 2022], and the BDD-X Dataset featuring driving videos [Kim et al., 2018] have been utilized. Games like Minecraft and OverCooked [AI, 2023] also serve to assess LLMs’ planning and cooperative capabilities in unpredictable environments.

Similarly, the SASs community has developed various exemplars, such as DeltaIoT [Iftikhar et al., 2017] and DARTSim [Moreno et al., 2019], to support research. However, these exemplars often face challenges when used to evaluate LLM-based or LLM-enhanced methods. Firstly, there is a discrepancy between the observation spaces designed for traditional analysis and planning methods and those required for LLMs. Secondly, many exemplar implementations do not fully conform to the MAPE-K structure, complicating the process of deriving observations. For example, some implementations may lack a knowledge module, relying instead on direct data transmission between different modules (even the transmission between the managed system and the planner). These two issues lead to additional costs for evaluating LLM methods, as it requires developing additional observations and interfaces for LLMs on the existing artifacts. To address these issues, we advocate that future exemplar implementations should aim to explicitly preserve the observation space required for LLMs, even if these observations are not necessary for current algorithms. Furthermore, there is a need for clear modularization of knowledge components within the system architecture to facilitate more effective evaluations of LLM methods.

Secondly, current exemplars typically assess performance based on utility, typically through simulation or testing. Although LLMs can function as end-to-end models (to cover all of MAPE-K), they would be more commonly integrated as modules within the architecture of SASs. In such scenarios, system testing can facilitate the comparison of performance differences before and after integrating LLMs in an ablation way. However, the inclusion of unit testing for specific modules requires further exploration. As we can observe in the studies of other fields, accurately evaluating the effectiveness of LLM-based modules poses new challenges. First, there is the issue of prompt robustness; LLMs, as stochastic black boxes, may produce varying outputs for the same prompt [Wang et al., 2023b]. Second, measuring the quality of outputs in formats like natural language poses significant challenges. LLM-as-a-Judge has been explored to evaluate LLM’s output by LLMs, where an LLM assesses the output of another LLM. However, recent research has highlighted limitations that LLMs tend to assign higher scores to their own outputs, which may stem from internal similarity preferences or inherent biases [Koo et al., 2023]. This has spurred further research into developing specialized LLMs designed for multi-dimensional judgment, such as LLM-EVAL [Lin and Chen, 2023] and PandaLM [Wang et al., 2024c].

6.8 Towards Self-Testing

Generally, software testing has always been regarded as a fault-finding process carried out during the development cycle. However, in the context of SASs, the problems are twofold: (i) the systems include a large number of possible contexts, configurations, and adaptation options; and (ii) the unpredictability of uncertainties and dynamics at design time, i.e., systems may encounter unforeseen conditions in their systems and environments at runtime. For the former, some traditional offline testing methods are expected to mitigate the problem. However, the current exploration of the more important latter is still relatively preliminary. Possibly relevant concepts include online testing [Bertolino et al., 2012], runtime testing [da Silva and de Lemos, 2011; Fredericks et al., 2013, 2014; Lahami and Krichen, 2021], field testing [Bertolino et al., 2021; Silva et al., 2024], and vivo testing [Murphy et al., 2009].

In SE, Transformers and LLMs have been extensively applied to automate testing processes. These applications are primarily categorized into three types. The first category includes the use of language models for fault localization [Ciborowska and Damevski, 2022; Yang et al., 2024a] and vulnerability detection [Mamede et al., 2023; Sun et al., 2024c; Zhang et al., 2024d, Zhou et al., 2024d]. The second category involves the automated generation of test oracles [Tsigkanos et al., 2023a, b], assertions [Tufano et al., 2022], and test cases [Bhatia et al., 2024; Hoffmann and Frister 2024; Plein et al., 2024; Rao et al., 2023], with further exploration of domain-specific tuning and knowledge enhancement documented in [Arora et al., 2024; Xue et al., 2024]. The third category features LLMs as fuzzers, where they generate abnormal or randomized input data (i.e., fuzz data), and their performance could be incrementally enhanced through iterations [Jha et al., 2024; Xia et al., 2024] or ICL [Deng et al., 2024]. Despite these advancements in automated testing, direct applications to testing for SASs remain scarce. To our knowledge, the only study directly associated with SASs is Co-Evolution of Production-Test Code [Hu et al., 2023a]. This Transformer-based approach focuses on identifying outdated test cases and automatically updating them in response to changes in the production code.

As outlined in [Fredericks et al., 2013; Silva et al., 2024], self-testing can also be conceptualized as a MAPE-K loop. This process involves several stages: monitoring to determine whether to trigger testing, analyzing environmental changes, planning new test case strategies or altering monitoring methods (such as adjusting detection targets or frequencies), and executing these tests on the software. The core challenge lies in identifying how environmental changes impact test cases, establishing traceability between requirements and test cases, and devising methods to generate new test cases. With LLMs’ understanding of environmental changes and their capabilities in testing automation, as discussed above, we anticipate further advancements in self-testing could be facilitated by LLMs.

6.9 Towards Self-Evolution

Software maintenance and evolution, in the context of SE, refer to the process of continuously updating software after its initial deployment, primarily aimed at correcting discovered problems, improving system performance, or adding new functionality.

Based on our survey, a substantial body of studies in SE primarily focuses on addressing identified bugs or vulnerabilities. Specifically, these studies leverage LLMs’ capabilities in code understanding and generation for code-level software repair and correction. Specific topics include vulnerability repair [Fu et al., 2022] and automated program repair [Berabi et al., 2021; Chen, 2024; Fan et al., 2023a; Guo et al., 2024d; Gupta et al., 2023; Huang et al., 2023b; Jiang et al., 2023b, 2024; Lajkó et al., 2024; Ribeiro et al., 2023; Santos et al., 2024; Sobania et al., 2023; Wang et al., 2023e; Wei et al., 2023b; Xia et al., 2023a, 2023b; Yuan et al., 2022].

However, in the context of SASs, research on evolution is limited and the few studies that focus on automatic evolution approach the problem from the viewpoint of scenarios that are not anticipated during design time [Weyns and Andersson 2023; Weyns et al., 2022a]. We anticipate that LLMs could facilitate two potential paradigms for implementing self-evolution in SASs. The first paradigm involves collective intelligence. The metaGPT multi-agent collaboration framework [Hong et al., 2024b] allows different LLM agents to assume various roles, and has demonstrated the capability to automate the entire waterfall development process, from requirements engineering to testing. Given that evolution in software can be viewed as an incremental development problem, especially within agile development contexts, using metaGPT as a framework for self-evolution represents an evident approach. The second paradigm centers on experience accumulation, a concept that parallels self-evolution, where LLM agents continuously acquire new “skills” for emerging tasks. This approach, combined with architecture-based adaptation, shows promise. For instance, if LLMs identify that existing components or APIs are inadequate for unforeseen runtime conditions, they can autonomously search for and integrate new APIs (from online sources) into the adaptation space and facilitate reasoning about these components through the available API documentation. Furthermore, within the context of LLM agents, Tao et al. [2024] propose a similar concept where LLM agents undergo self-evolution through four stages: experience acquisition, experience refinement, updating, and evaluation. However, a critical gap is that self-evolution in LLM agents often relies on natural language descriptions, while SASs typically reason using knowledge expressed in some form of a DSML. This results in a need to define observations and skills in such a DSML. In the context of self-evolution, it might also necessitate evolving the DSML accordingly (e.g., by introducing new actions and events into the Markov Decision Process) and adapting the corresponding reasoning methods.

6.10 Inherent Shortcomings of LLMs and Mitigation Strategies

While this article does not delve into the technical specifics of LLMs, understanding and addressing their inherent limitations is crucial when employing them in practical SASs.

Firstly, a significant issue with LLMs is hallucination, which refers to the potential generation of misleading or factually incorrect content, thereby impacting the overall reliability and trust of the system. This problem can be mitigated by human verification when it is used in the design-time phase, such as during translation and knowledge construction, or mitigated by other algorithm’s evaluation mechanisms when LLMs do not directly generate the final outcomes, e.g., used to generate heuristics for searching methods. However, when LLMs are employed at runtime to directly produce final outcomes, such as in monitoring, architecture-based adaptation, or HOTL, this issue requires rigorous consideration. As mitigation strategies, techniques like RAG and feedback mechanisms (including human feedback, interactive environmental feedback, and feedback from other LLMs) can help reduce hallucination.

Secondly, LLMs often require high deployment and operational costs due to their large number of parameters, necessitating high-performance hardware and generally resulting in slower inference speeds. This restricts the applications of LLMs for local deployment on devices and in scenarios demanding fast or real-time responses. Strategies such as model quantization [Hubara et al., 2017], knowledge distillation [Xu et al., 2024a], and hardware acceleration [Kachris, 2024] are fundamental and generalized approaches to reducing usage costs or improving response times. Additionally, combining LLMs with other methods can also mitigate these issues. For instance, in Question and Answer systems, it is a common strategy to combine a rule-based engine triggered by keywords for frequently asked questions with LLMs (for answering other questions). Furthermore, selecting appropriate LLMs (with different scales) [Tian et al., 2023] and/or employing suitable prompt strategies (e.g., whether to enable CoT) [Pan et al., 2024], based on the complexity of the problem (as runtime context), is also a practical deployment strategy.

Thirdly, the values embedded within LLMs also constitute a significant concern. As previously discussed in Section 2.4, human or value alignment within the context of LLMs typically involves aligning models with positive values such as honesty and helpfulness. However, similar to the physically-grounding capabilities in execution, overemphasizing these positive values can disconnect decision-making from real-world reality, leading to “biased” outcomes. Lin et al. [2024] describes this issue as “alignment tax,” where LLMs may compromise or weaken certain capabilities through RLHF aimed at promoting positive behaviors. Furthermore, Chen et al. [2023b] demonstrate through experimentation that while LLMs can accurately judge the truthfulness of negative commonsense knowledge (like answering “No” to questions such as “Do lions live in the ocean?”), their reasoning on such knowledge tends to be overly positive to produce erroneous outcomes like “Lions live in the ocean.” In the context of this article, the concept of alignment tax has several implications. Firstly, in requirement-driven adaptation and testing—especially in penetration testing [Deng et al., 2023; Happe and Cito, 2023]—it often requires a perspective that considers negative aspects, akin to adversarial thinking in cybersecurity [Hamman et al., 2017]. Secondly, in collaboration scenarios, particularly in task allocation, Guo et al. [2024b] suggests that a “stricter” leadership approach might enhance team efficiency compared to a more lenient style. More broadly, the alignment tax could potentially influence any analysis and planning activities of LLMs. Addressing these complexities involves exploring how LLMs can incorporate a balanced spectrum of human values, including those perceived as negative, and how prompt engineering can enhance this balance.

Fourthly, security and privacy remain particularly significant and hard concerns [Das et al., 2024a]. Regarding security, when employing LLMs, it is crucial to be aware of their vulnerabilities. These vulnerabilities primarily encompass data poisoning and backdoor attacks, as well as instruction tuning attacks, including jailbreaking and prompt injection. To mitigate these risks during usage, instruction pre-processing and generation post-processing are commonly implemented strategies. Concerning privacy, the use of third-party LLMs that are not locally deployed poses a risk of data retention, which could be utilized in future training sets and potentially exposed through model inversion attacks, and extraction attacks [Yao et al., 2024]. Moreover, there is a concern that LLMs may inadvertently leak private information during interactions with external tools, such as those occurring in architecture-based adaptations. As countermeasures, local deployment and the employment of privacy-preserving prompting techniques [Hong et al., 2024a] can help address these privacy concerns.

Last but not least, LLMs’ complexity and vast scale result in a lack of explainability [Zhao et al., 2023a]. Therefore, using LLMs in critical and safety-critical applications necessitates careful consideration. Integrating techniques such as attention visualization [Chefer et al., 2021] and feature attribution [Sundararajan et al., 2017] could potentially help mitigate these issues.

7 Threats to Validity

The first threat to our research methodology is the limitation of our literature search coverage due to our reliance on specific title keywords to search for literature within particular conferences. We primarily focused on articles in leading conferences due to their balance of quality and timeliness, but this approach might have caused us to overlook valuable and recent articles published in journals or on preprint platforms like ArXiv. Additionally, our search keyword was specifically targeted to GenAI, deliberately avoiding broader keywords such as AI, deep learning, or NLP. This approach was chosen to minimize the inclusion of irrelevant publications and enhance the efficiency of our survey. However, we recognize the risk that this decision may have led to the exclusion of relevant literature that did not specifically mention GenAI in the title. To mitigate these limitations, we expanded our search to include as many conferences and keywords as possible within our limited timeframe, although a more comprehensive systematic literature review would be ideal to further address this issue.

The second threat stems from our method of filtering literature for relevance to SASs, which was based on specific rules. These rules may reflect the authors’ biases, potentially impacting the objectivity and accuracy of our literature selection. To mitigate these concerns, we first refer to the MAPE-K loop and the three key design principles for HOTL to identify the literature’s relevance to SASs. Additionally, the filtering process was collaboratively conducted by two authors, with a third author involved to resolve any discrepancies. Simultaneously, as discussions occur about the parts of ambiguity among the authors, the rule is accordingly refined and integrated several times.

The third threat involves the potential subjectivity in our interpretation of categorization. Even with structured guidelines, the classification of literature into categories can be influenced by individual perspectives, which may affect the neutrality of the analysis. To mitigate this, categorization was also based on discussions involving two or more authors.

The fourth threat concerns the presentation of the literature. To maintain a balance between the length of the article and the distribution of content, we simplified the discussion of certain topics. For instance, despite the rich GenAI literature on monitoring and some specific planning techniques, these topics are not the main focus of research within SASs, leading to more concise explanations in our article. Similarly, discussions on different architectural variants of Transformers and the various LLM prompt designs for different tasks are also simplified, as they might provide less insight for the readers. Such a presentation may reflect the authors’ inherent biases. To counteract this, we mention all literature in the main text as comprehensively as possible (even if it leads to redundancy), and we make the survey data publicly accessible. This allows interested readers to delve into the original articles for a deeper understanding of the technical details.

8 Conclusion

In this article, we aimed to shed light on the potential use and challenges of applying GenAI in SASs. To that end, we first presented a comprehensive and systematic overview of the potential use of GenAI in SASs. Our overview involved gathering literature from four distinct research fields: artificial intelligence, SE, HCI, and robotics. We then conducted a thorough filtering and categorization of this literature. Specifically, we organized the literature into two main categories: the first involves enhancing autonomy and adaptability by augmenting the modules within the MAPE-K feedback loop; the second focuses on improving HOTL interactions, enhancing the interaction between humans and SASs in terms of preference acquisition, transparency, and collaboration.

Leveraging these insights, we then outlined a research roadmap that identified specific challenges for further integrating GenAI within SASs. The roadmap provided a research map outlining future research directions that remain to be addressed in the integration of GenAI into SASs. We concluded with a discussion of current shortcomings of GenAI, particularly LLMs, and potential mitigation strategies that need consideration for practical deployment.

We hope that this article will serve as a source of inspiration for anyone with an interest at the crossroads of GenAI and self-adaptation. We anticipate that the realization of the roadmap will demand a multi-year and concerted research effort of different research groups around the globe.

Acknowledgement

We extend our sincere thanks to the anonymous reviewers for their insightful comments and suggestions, which have greatly enhanced the quality of this manuscript.

Footnotes

^†

Invited article as part of ACM TAAS Editor’s Special Collection to inaugurate a new continuous theme on “Generative AI and Large Language Models for Autonomous and Adaptive Systems.”

It is important to note that the output of LLMs simulates human reasoning processes, this does not imply that LLMs, which are based on neural networks, are truly reasoning in the way symbolic AI approaches do [Fedorenko et al., 2024].

Strictly speaking, LLMs such as GPT are specific instances of the Transformer architecture, and large-scale Transformers that are trained using language text can also be referred to as LLMs. To clarify distinctions within this article, “LLMs” will predominantly denote models trained with large volumes of natural language text. Conversely, Transformer or BERT will refer to models trained on domain-specific datasets.

We omit “automating tasks,” typically associated with the MAPE-K framework, and “runtime models,” covered in Section 4.4.

⁴

In this context, the term “feature engineering” pertains to ML rather than SE. Here, “features” denote the attributes or variables that characterize each instance within a dataset, as opposed to the functional components of a software product designed to fulfill specific user requirements.

A Appendix

A Evaluation Metrics

Table A1 briefly summarizes the metrics for evaluation, as employed in the selected literature, to facilitate a comprehensive evaluation of GenAI when applied in SASs.

Table A1.

	GenAI’s usage	Evaluation Metrics
M	data structing	Log: group and parsing accuracy (Precision, Recall, F1), edit distance
	data structing	DB: exact match, execution match, execution accuracy, valid efficiency score
	anomaly detection	classification metrics: Precision, Recall, F1
	anomaly detection	prediction/regression metrics: MSE (Mean Squared Error), MAPE (Mean Absolute Percentage Error), MAE (Mean Absolute Error)
	time series forecasting	MSE, MAE, MAPE, RMSE (Root Mean Squared Error), NMAE (Normalized Mean Absolute Error), NRMSE (Normalized Root Mean Squared Error), CRPS (Continuous Ranked Probability Score), normalized quantile loss (\(\pi\) -risk), SSR (spread-skill ratio)
	event sequence prediction	prediction accuracy, MR (Mean Rank), MRR (Mean Reciprocal Rank)
AP	performance-related metrics	(human normalized) score/return/reward, win rate, success rate (for both task solved and optimization found), # of task/problem solved, length of successful plan (the shorter the better)
	cost-related metrics	TF/DM: training speed, sample efficiency, diffusion step, planning time
	cost-related metrics	LLM: cost-effectiveness, # of API calls, token cost
	LLM-specific metrics	# of replan attempt (until plan success), # of option/skill created, executability % of plan
E	robotic execution	success rate, efficiency (of the navigated route)
K	knowledge construction and translation	# of (error in) predicates/literals, Miss Ratio, POC (Partial Ordering Count), BLEU (bilingual evaluation understudy), unique objects %, human score, Accuracy %, Novelty %, Kendall’s rank correlation coefficient, Spearman’s rank correlation coefficient, AWM size (verified through interaction with the environment)
H	preference acquisition	subjective Likert score, # of changes required
	transparency	code/log: BLEU (BLEU-CU, BLEU-DC), CIDEr, METEOR, ROUGE-L, Flesch-Kincaid Grade Level, subjective Likert score (e.g., usefulness and readability)
	transparency	user correction: # of correction, matching accuracy (P, R, F1)
	collaboration	subjective Likert score, human efforts (%), reward, game score, successful support rate, success steps, task completeness, sub-goal correctness, token cost, # of API calls

Table A1. A Brief Summary of Evaluation Metrics

M, AP, E, K represent the modules within MAPE-K, H represents HOTL, TF represents Transformer, DM represents Diffusion model.

References

[1]

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4998–5007. DOI: