Skip to content

Survey Report: Frontier Progress in Large Language Models—From Foundational Training to Multimodal Interaction

:material-circle-edit-outline: 约 14358 个字 :material-clock-time-two-outline: 预计阅读时间 48 分钟

Introduction

In recent years, Large Language Models (LLMs) have achieved breakthrough progress, becoming a core driving force in natural language processing and the broader field of artificial intelligence. They have demonstrated unprecedented capabilities in understanding, generation, reasoning, and interaction with humans, bringing transformative impacts to numerous application scenarios. However, the development of LLMs still faces many key challenges, including how to continuously enhance the core capabilities of models, how to ensure model behavior aligns with human intent and values, how to effectively integrate and process multimodal information such as images, audio, and video, and how to build more natural and efficient human-computer interaction methods.

This survey aims to systematically review and summarize the latest progress and core technical methods of LLMs in frontier directions such as foundational model construction and open-sourcing, instruction following and alignment techniques, and the evolution towards multimodal understanding and real-time natural interaction. This is achieved by in-depth analysis of seven representative research papers in different LLM development directions. Through the study of the research background, design ideas, key achievements, and critical analysis of these works, it is hoped to provide valuable references for understanding the current development status, technical bottlenecks, and future trends of LLMs.

Table 1: Overview of Core Papers in This Survey

Paper Title Topic Category arXiv ID Core Contribution/Research Point
Llama 2: Open Foundation and Fine-Tuned Chat Models Open-Source LLM Training Pipeline 2307.09288 Released a series of open-source, high-performance pretrained and fine-tuned chat models (7B-70B), and detailed their training, alignment, and safety methods. 1
FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS SFT 2109.01652 Proposed Instruction Tuning, demonstrating its significant improvement in the zero-shot learning capabilities of language models on unseen tasks. 2
Training language models to follow instructions with human feedback RLHF 2203.02155 Detailed the method of training language models to follow instructions using Reinforcement Learning from Human Feedback (RLHF), making their outputs more aligned with human preferences and more "helpful, honest, and harmless." 5
MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS LLM → VLLM 2304.10592 Explored aligning a frozen visual encoder with an advanced frozen LLM via a simple projection layer to achieve advanced multimodal understanding and generation capabilities similar to GPT-4. 6
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents Time Synchronization 2409.15594 Aimed to break the traditional turn-based dialogue model by integrating time information into LLMs, synchronizing them with a real-world clock to enable full-duplex voice dialogue supporting user barge-in and overlapping speech. 10
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Audio Dialogue Model 2408.16725 Proposed an end-to-end audio dialogue model capable of real-time voice interaction, achieving "hearing, talking while thinking" in streaming, while aiming to preserve the powerful capabilities of the original language model and reduce reliance on external TTS systems. 13
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Video Understanding Model 2306.02858 Focused on enabling LLMs to simultaneously understand visual dynamics and sound events in videos, integrating audio-visual signals through a multi-branch architecture and techniques like ImageBind, and performing instruction tuning for conversational understanding of video content. 16

I. Research Background

1.1 Cornerstones of LLMs: Pretraining, Fine-tuning, and Alignment

Large-scale pretrained language models have become a significant driving force in the field of artificial intelligence. These models, through self-supervised learning on massive text data, have acquired powerful language understanding and generation capabilities, demonstrating immense potential in complex reasoning, professional knowledge Q\&A, code generation, and even creative writing.1 However, raw models that have only undergone pretraining, while knowledgeable, often have behavioral patterns that are not directly applicable to specific downstream tasks. More importantly, they may not align well with users' specific intents and expectations, and may even produce untruthful, biased, or harmful outputs.5 To address these issues, a series of fine-tuning and alignment techniques have emerged, forming the crucial link in transforming pretrained LLMs into practical AI assistants.

Core concepts and their evolution include:

  • Pretraining: This is the foundation of LLM capabilities. Models (usually Transformer architecture) are trained on unlabeled text data containing trillions of tokens, typically with the objective of predicting the next token in a text sequence (autoregressive language modeling) or filling in masked parts of text. Through this process, models learn rich syntactic, semantic, pragmatic knowledge, and even a degree of world knowledge. For example, the pretraining of the Llama 2 model employed an optimized autoregressive Transformer architecture, used up to 2 trillion tokens of publicly available data, and introduced techniques like Grouped-Query Attention (GQA) to improve efficiency and performance.1
  • Supervised Fine-Tuning (SFT): After pretraining, models usually undergo an SFT phase. In this stage, the model is fine-tuned on a relatively small but high-quality dataset consisting of "instruction-expected answer" pairs. By learning these samples, the model initially acquires the ability to understand and follow human instructions. The Llama 2 study particularly emphasized the importance of SFT data quality, noting that tens of thousands of high-quality SFT annotation samples can achieve good results, far surpassing the use of millions of low-quality third-party data.1 InstructGPT also used SFT as the first step in its three-stage alignment process to teach the model basic instruction-following behavior.5
  • Instruction Tuning: This is a special form of SFT, the core idea of which is to improve the model's generalization ability on previously unseen tasks, especially zero-shot learning ability, by fine-tuning on a large number of instruction datasets covering different task types. The core viewpoint of the FLAN (Finetuned Language Models Are Zero-Shot Learners) paper 2 is precisely that "instruction tuning makes language models zero-shot learners." By performing instruction tuning on a collection of over 60 NLP tasks (all described via natural language instructions), FLAN significantly improved the model's zero-shot performance on unknown tasks.
  • Zero-Shot Learning: This refers to the ability of a model to perform a task based solely on its natural language description (i.e., instruction), without having seen any training samples for that specific task. For example, given the instruction "Translate this sentence from English to French: Hello world.", a model with good zero-shot learning ability should be able to directly provide the correct French translation, even if it was not specifically trained for English-French translation during fine-tuning. FLAN's research showed that instruction tuning is a key way to enhance this zero-shot capability.2
  • Reinforcement Learning from Human Feedback (RLHF): This is a more advanced alignment technique aimed at further optimizing model behavior through human preference data, making its output more consistent with human expectations. RLHF typically includes the following steps: first, collecting human preference rankings for different model outputs generated for the same input (e.g., which answer is better); second, using this preference data to train a Reward Model (RM) that learns to predict the degree of human preference for model outputs; finally, using this reward model as a reward signal in a reinforcement learning environment, employing algorithms such as Proximal Policy Optimization (PPO) to fine-tune the language model itself, so that its generated outputs can achieve higher reward scores, thereby more closely aligning with human preferences. Llama 2 1 and InstructGPT 5 both detailed the core role of RLHF in their model alignment processes, including preference data collection strategies, reward model design and training, and iterative fine-tuning in the reinforcement learning phase.

The evolution of these techniques has led to a series of key research questions, such as:

  • How can we build and open up powerful foundational language models while ensuring their safe use, thereby benefiting a broader research and application community? (Llama 2 1)
  • How exactly does instruction tuning improve a model's zero-shot generalization ability? What are the key factors for its success (e.g., number of tasks, model scale, instruction format)? (FLAN 2)
  • How can language models be effectively aligned with users' complex intents through human feedback—an indirect but more human-evaluation-aligned method—to make them perform better as "helpful, honest, and harmless" in practical applications? (InstructGPT 5; Llama 2 1)

Examining these foundational works, an important trend emerges: alignment techniques are the key step to unlocking the potential of LLMs and moving them from theory to practice. However, high-quality, diverse alignment data is the core bottleneck in this process. Original LLMs, such as the early GPT series or Llama 1, although possessing vast knowledge after pretraining, often produce outputs that do not meet human expectations, potentially filled with repetition, bias, or even harmful content.1 The application of alignment techniques like SFT and RLHF, as demonstrated in the InstructGPT and Llama 2-Chat models, can significantly improve the usefulness, truthfulness, and safety of models.1 FLAN's research also clearly shows that through instruction tuning, models can better generalize to unseen tasks, greatly enhancing their zero-shot capabilities.3 These alignment techniques (SFT, RLHF, instruction tuning) form the bridge connecting powerful pretrained models with practical AI assistants. However, the effectiveness of these techniques is highly dependent on the alignment data used. The Llama 2 study particularly emphasized the principle of "quality over quantity" in the SFT phase, finding that a small amount of high-quality human-annotated data is far more effective than a large amount of low-quality data.1 Similarly, InstructGPT's success was built on high-quality human-annotated demonstration data and preference data.5 This means that future competition in the LLM field, beyond model scale and pretraining data volume, will more critically depend on the ability to acquire and effectively utilize high-quality, diverse alignment data. This not only places higher demands on the data annotation industry but may also spawn new research directions, such as how to generate or filter high-quality alignment data at lower cost and higher efficiency, or how to design alignment algorithms that are less dependent on data.

1.2 Expansion of LLMs: Towards Multimodal Understanding and Interaction

Information in the real world is inherently multimodal, encompassing text, images, audio, video, and other forms. LLMs that rely solely on text for interaction have significantly limited capabilities when it comes to understanding complex real-world scenarios and completing tasks that require integrating multiple types of information.16 Therefore, endowing LLMs with the ability to understand, process, and even generate information in multiple modalities has become a crucial research hotspot in the current field of artificial intelligence.6 This trend is driving the evolution of LLMs from pure text processing towards richer multimodal intelligence.

Core concepts and their evolution include:

  • Vision-Language Models (VLM / Vision Large Language Models, VLLM): These models aim to combine visual information (primarily static images) with powerful language models, enabling the model to "see" and understand image content, and to generate text based on image content (e.g., image description), answer questions (Visual Question Answering, VQA), or perform more complex reasoning. The research goal of MiniGPT-4 6 was precisely to explore whether aligning a frozen visual encoder (like ViT) with an advanced frozen LLM (like Vicuna) could exhibit advanced multimodal capabilities similar to GPT-4, such as generating website code from hand-drawn sketches or identifying humorous elements in images.
  • Audio-Visual Language Models: Building on VLMs, these models further integrate auditory information (audio) into LLMs, enabling them to understand complex multimodal data like videos, which contain dynamic visual scenes and synchronous sound events. The work of Video-LLaMA 16 aimed to address two core challenges: first, how to effectively capture changes in visual scenes over time, and second, how to effectively integrate audio-visual signals and combine them with the LLM's understanding capabilities.
  • Streaming and Real-time Interaction: These concepts focus on how models process information and generate responses. Traditional LLMs usually need to receive the complete user input before starting to process and generate a reply, which can cause noticeable delays in dialogue scenarios, affecting the naturalness of interaction. Streaming and real-time interaction, on the other hand, pursue the model's ability to think and gradually generate responses while continuously receiving input information, much like humans, thereby achieving smoother, more immediate interaction. The research of Mini-Omni 13 particularly emphasized the model's ability to directly process audio modality and perform reasoning in streaming output, aiming to overcome the high latency caused by traditional reliance on external Text-to-Speech (TTS) systems and achieve an "hear, talk while thinking" interaction mode.
  • Full-Duplex Dialogue: This is the pursuit of a more natural human-computer dialogue mode. In traditional half-duplex or turn-based dialogue systems, one party must finish speaking before the other can begin. Human dialogue, however, is typically full-duplex, allowing both parties to speak and listen simultaneously, thus enabling richer interaction dynamics such as immediate feedback (e.g., "uh-huh," "yes"), interrupting the other party for clarification or supplementation, and overlapping speech.12 The core goal of the paper "Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents" 10 is to enable LLMs to achieve full-duplex dialogue synchronized with a real-world clock, thereby simulating more natural human communication.

These expansions towards multimodality and more natural interaction have led to a series of new research questions:

  • How can pretrained, powerful visual encoders be efficiently and effectively aligned with advanced language models to achieve or replicate impressive advanced multimodal capabilities like those of GPT-4 at a low cost? (MiniGPT-4 6)
  • How can model architectures and training strategies be designed to enable LLMs to simultaneously understand time-varying visual content and synchronous audio events in videos, and to interact meaningfully based on this comprehensive understanding? (Video-LLaMA 16)
  • How can the limitations of traditional dialogue systems' turn-taking be overcome to enable LLMs to achieve full-duplex voice dialogue synchronized with a real physical clock, supporting natural user interruptions, thereby significantly improving interaction naturalness and efficiency? (Synchronous LLMs 10)
  • How can an end-to-end audio dialogue model be built that can, like a human, "hear, talk while thinking" in real-time, while maintaining and utilizing the powerful understanding and reasoning capabilities of its underlying language model during fluent voice interaction? (Mini-Omni 13)

Observing the development of these multimodal LLMs, a clear evolutionary path can be identified: from an initial focus on "understanding" multimodal information, gradually developing towards the pursuit of more complex "interaction," and in this process, increasingly emphasizing the "real-time" and "naturalness" of interaction. Early vision-language models primarily focused on tasks biased towards "understanding," such as image description and visual question answering. MiniGPT-4 6 and Video-LLaMA 16, while still having strong understanding components (e.g., generating detailed image descriptions or answering questions about video content), have begun to exhibit stronger interactive capabilities, such as generating stories or poems inspired by images, or guiding users in cooking based on food photos. Subsequent works like Synchronous LLMs 10 and Mini-Omni 13 more explicitly set their goals on improving the real-time nature of dialogue (through streaming processing) and naturalness (through features like full-duplex interaction, hearing, talking while thinking). The driving force behind this evolution stems from users' deep desire for AI systems to communicate as fluently, naturally, and efficiently as humans do with each other. This not only places higher demands on model architecture design (e.g., end-to-end models, streaming processing modules, time synchronization mechanisms) but also brings new challenges to the type and quality of training data (e.g., requiring dual-channel dialogue data that includes real interruptions and overlapping phenomena, as discussed in 19 and 11) and model evaluation methods (e.g., how to evaluate the natural fluency of dialogue and the appropriateness of interruption handling, as explored in the discussion of full-duplex capability benchmarks in 20). This suggests that future research in multimodal LLMs may focus more on enhancing the model's "interactive intelligence" rather than just its "perceptual intelligence."

II. Analysis of Core Paper Research Designs

This chapter will delve into the research designs of the seven core papers selected for this survey, including their overall ideas, model architectures, data processing, training methods, and key technical innovations.

2.1 Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv: 2307.09288)

Llama 2 1 is a series of large language models developed and released by Meta AI, with parameter sizes ranging from 7 billion (7B) to 70 billion (70B). This series not only includes pretrained foundational models (Llama 2) but, more importantly, introduces fine-tuned models Llama 2-Chat, optimized for dialogue scenarios. The core contribution of this work lies in the openness of its models and the detailed exposition of its training and alignment methodologies.

  • Pretraining:
    Llama 2's pretraining builds upon Llama 1 with several improvements.
    • Data: A new, publicly available data mix was used for training, with the total number of tokens increased by 40% compared to Llama 1, reaching 2 trillion tokens. The model's context length was also doubled from Llama 1's 2048 tokens to 4096 tokens. To enhance the model's knowledge base and reduce hallucinations (i.e., fabricating facts), the research team up-sampled factual content in the data sources.1
    • Model Architecture: Llama 2 employs an optimized standard Transformer architecture. Specific techniques include using RMSNorm for pre-normalization to stabilize training, SwiGLU activation function for performance improvement, and rotary position embeddings (RoPE) to handle positional information in sequences. Compared to Llama 1, besides the increased context length, a major architectural improvement was the adoption of Grouped-Query Attention (GQA) in larger models (e.g., 34B and 70B). GQA significantly reduces memory footprint and computation during inference by allowing multiple query heads to share the same set of key and value heads in the multi-head attention mechanism, thereby improving the scalability of larger models while maintaining performance.1
  • Supervised Fine-Tuning (SFT):
    The goal of the SFT stage is to enable the pretrained model to initially understand and follow instructions.
    • Data: The Llama 2 research team particularly emphasized the importance of SFT data quality. They found that focusing on collecting a few thousand (eventually using about 27,540) high-quality SFT annotation samples, either human-written or screened, led to more significant model performance improvements compared to using millions of third-party SFT data of varying quality. This indicates that in the SFT stage, "quality" is far more crucial than "quantity".1
    • Method: During SFT, the model is trained using a standard autoregressive objective, i.e., predicting the next token in the sequence. A key detail is that the loss function is calculated and backpropagated only on the tokens of the answer part, while the tokens of the user input instruction are not included in the loss. This ensures the model focuses on learning how to generate good answers. A cosine learning rate schedule was used, with an initial learning rate of 2×10−5, a weight decay of 0.1, a batch size of 64, and a sequence length uniformly set to 4096 tokens.1
  • Reinforcement Learning from Human Feedback (RLHF):
    RLHF is a key step to further align model behavior to better match human preferences and expectations.
    • Data Collection: Llama 2 employed a binary comparison protocol to collect human preference data. Annotators would first write a prompt, and then the model would generate two different responses to this prompt. Annotators then needed to judge which response was better based on a set of predefined criteria (e.g., helpfulness, honesty, harmlessness) and indicate the degree of preference (e.g., "significantly better," "slightly better"). Through this method, the research team collected over 1 million human binary comparisons of model outputs.1
    • Reward Modeling (RM): Based on the collected human preference data, Llama 2 trained two separate reward models: one focused on evaluating the "Helpfulness" of responses, and the other on evaluating "Safety." The use of two separate RMs was to better handle potential conflicts and trade-offs between these two objectives. These RMs themselves were initialized from pretrained chat model checkpoints to leverage the model's existing language understanding capabilities. When training RMs, a binary ranking loss function was used, and a margin component was introduced, so that samples with more significant preference differences would achieve a larger score difference from the RM, which helped improve RM accuracy.1
    • Iterative Fine-tuning: In the RLHF fine-tuning phase, Llama 2 explored and combined two main reinforcement learning algorithms: Proximal Policy Optimization (PPO) and Rejection Sampling fine-tuning. Typically, rejection sampling was performed first, where multiple candidate responses were generated from the current model, the RM selected the optimal response, and these optimal responses were then used as new SFT data to fine-tune the model. Afterwards, the PPO algorithm was applied on top of the rejection sampling fine-tuned model, using the RM's output as the reward signal to further optimize the model's policy to maximize cumulative reward. During the PPO phase, signals from both the helpfulness RM and the safety RM were considered to guide the model's optimization direction.1
  • Multi-turn Consistency (Ghost Attention - GAtt):
    To address the issue of early RLHF models easily forgetting or deviating from initial instructions in multi-turn dialogues, Llama 2 proposed a simple yet effective method called Ghost Attention (GAtt). The core idea of GAtt is to comprehensively and repeatedly append the initial system-level instruction (e.g., "You are now playing the role of a helpful AI assistant") to every user message in each turn of the multi-turn dialogue data. The model is then fine-tuned on this processed synthetic data. In this way, the model is continuously "reminded" of the initial instruction, thereby enhancing its ability to remember and follow these instructions in multi-turn dialogues.1
  • Safety Alignment:
    Ensuring model safety was a core consideration during the development of Llama 2.
    • Pretraining Data Considerations: Interestingly, Llama 2 did not actively or aggressively filter out potentially negative content such as hate speech from the data during the pretraining phase. The research team believed that retaining this data might actually allow the model to have better generalization capabilities in the subsequent safety tuning phase, i.e., achieving good safety alignment with less safety annotation data. Of course, this also means that Llama 2's base model itself requires extensive safety tuning to be deployed safely.1
    • Safety Fine-tuning Techniques: Llama 2 employed a multi-layered set of safety fine-tuning techniques, including:
      1. Supervised Safety Fine-Tuning: Collecting data containing adversarial prompts (i.e., prompts attempting to induce the model to produce unsafe outputs) and corresponding safe response demonstrations, and adding this data to the SFT phase, allowing the model to learn how to respond safely early on.
      2. Safety RLHF: Integrating safety considerations into the RLHF process, such as training the aforementioned safety-specific reward model, and collecting more challenging adversarial prompts for rejection sampling and PPO optimization.
      3. Safety Context Distillation: This is a technique to guide the model to generate safer responses. Specifically, a safety prefix (e.g., "You are a safe and responsible assistant") is artificially added before prompts that might elicit unsafe answers. The model then generates responses under this guidance. Since the safety prefix guides the model, the generated responses are usually safer. Then, these "prompt-safe answer" pairs (without the safety prefix) are used to fine-tune the model, thereby "distilling" this safe behavior into the model itself, enabling it to provide safe answers even without the safety prefix.1
    • Red Teaming: To proactively identify and mitigate potential risks of the model, Llama 2 underwent extensive red teaming. Diverse teams, including internal employees, contract workers, and external domain experts, were invited to probe the model's safety boundaries from various predefined risk categories (e.g., illegal activities, hate speech, unqualified advice) and attack vectors (e.g., psychological manipulation, logical loopholes, syntactic deception). Issues and data found during red teaming were fed back into the model's iterative development for guiding subsequent safety training and improvements.1

Llama 2's research design reflects a comprehensive consideration for building high-performance, responsible open-source LLMs. It not only open-sourced the model weights but, more importantly, disclosed with unprecedented transparency the complete methodology and practical experience from pretraining, SFT, RLHF, to safety alignment.1 This approach contrasts sharply with many powerful closed-source models, whose performance is superior but whose training methods and alignment strategies are often unknown to the outside world, like a "black box".1 Llama 2's openness and detailed documentation provide a valuable template and practical path for academia and industry, greatly lowering the barrier for researchers to reproduce, improve, and build similar high-performance LLMs. Specifically, its extreme pursuit of data quality in the SFT stage 1, the innovative dual reward model design in the RLHF stage to balance usefulness and safety 1, the GAtt mechanism proposed for multi-turn dialogue consistency 1, and the multi-stage, multi-technology integrated safety alignment strategy (including context distillation and extensive red teaming 1) are all very specific and highly valuable practical experiences. This open stance will undoubtedly accelerate the democratization of AI technology, encourage the open-sourcing of more high-performance models, and promote the entire field towards a more transparent and responsible direction. At the same time, Llama 2's high emphasis on and systematic investment in safety also set an important benchmark for the industry, potentially triggering more in-depth discussions on the potential risks and social benefits of open-source models, and promoting the establishment of relevant governance frameworks and best practices.

2.2 FLAN: Finetuned Language Models Are Zero-Shot Learners (arXiv: 2109.01652)

The core idea of FLAN 2 is that through a method called "Instruction Tuning," the zero-shot learning ability of pretrained language models on tasks they have never seen before can be significantly improved. This means the model can perform a new task based solely on its natural language description (i.e., instruction) without requiring additional sample learning for that task.

  • Model Basis: The study used a pretrained language model with 137 billion parameters as its foundation. This model is stated to be a variant of LaMDA-PT (a language model optimized for dialogue).2
  • Construction of Instruction Datasets: This was a core part of FLAN's research design.
    • Researchers constructed a mixture of over 60 different NLP datasets. These datasets were organized into multiple task clusters, covering various common NLP task types such as natural language inference (NLI), reading comprehension, translation, commonsense reasoning, and sentiment analysis.2
    • Crucially, each dataset was "instructionalized" or "verbalized" through a series of natural language instruction templates. For example, for a sentiment classification task, the instruction could be "Please determine if the sentiment of the following sentence is positive, negative, or neutral: [sentence]". To increase instruction diversity and improve the model's generalization ability, researchers designed multiple (e.g., 10) unique instruction templates for each dataset.2
    • When evaluating the model's performance on a specific task type (e.g., natural language inference), researchers ensured that the instruction tuning training set did not include any tasks from that task cluster. This was done to strictly test whether the model truly learned to generalize to "unseen" task types, rather than just memorizing the tasks it was trained on.2
  • Training Setup: After constructing the instruction dataset, researchers used this mixed instruction dataset to fine-tune the 137B parameter base pretrained language model. This process is instruction tuning.
  • Evaluation Method: The performance of the instruction-tuned model (called FLAN) was primarily evaluated by its zero-shot performance on task types not included in the training set. Its results were compared with the zero-shot or few-shot performance of the original, non-instruction-tuned base model, as well as other SOTA models at the time (like GPT-3).3
  • Key Design Considerations (Ablation Studies Focus): To deeply understand the elements of instruction tuning success, researchers conducted a series of ablation studies, focusing on the impact of the following factors on the effectiveness of instruction tuning:
    1. Number of fine-tuning tasks: Does using more instructionalized tasks during training lead to stronger generalization ability on unseen tasks?
    2. Model scale: Does the effectiveness of instruction tuning depend on the size of the base model?
    3. Natural language formulation of instructions: Do the wording, diversity, etc., of instructions affect the results? For example, what is the effect of only providing input-output examples without explicit instructions? 3

FLAN's success 3 profoundly reveals a very effective and intuitive form of "meta-learning" in large language models. Its core is not to make the model memorize specific task knowledge, but rather to teach the model a more general meta-skill of "how to understand and follow instructions" by exposing it to a large number of different tasks presented with natural language instructions.2 Once the model masters this meta-skill, it can better apply its existing pretrained knowledge to new tasks described only by instructions, thereby unlocking its zero-shot potential on unknown tasks. This method of improving generalization ability through instruction diversity, rather than simply increasing model scale or pretraining data volume, can more directly enhance the model's universality and ease of use in practical applications. As FLAN's experiments showed, the more diverse and numerous the tasks involved in instruction tuning, the better the model performed on unseen, entirely new tasks [2 "More tasks in training = Better performance on held-out (and new) tasks"]. This is similar to the human learning process: we cultivate general problem-solving skills and the ability to learn new knowledge by learning to solve various types of problems. From this perspective, instruction tuning is teaching the model "how to learn." This finding has significant implications for the development of the LLM field: model performance improvement depends not only on "big data" and "big models" but, more importantly, on "good learning methods" and "high-quality guidance information." Future research may focus more on designing more effective and generalizable instruction formats, constructing instruction datasets that cover a broader range of capabilities and are more challenging, and extending this ability to learn "meta-skills" through instructions to more complex reasoning tasks and multimodal tasks. Furthermore, FLAN's approach also provides a viable path for smaller language models (SLMs) to achieve or even surpass the performance of larger general-purpose models (LLMs) under prompting engineering in specific domains through efficient instruction tuning, as observed in some subsequent studies in specific domains (e.g., low-code workflow generation).21

2.3 InstructGPT: Training language models to follow instructions with human feedback (arXiv: 2203.02155)

The core goal of InstructGPT 5 was to address the problem that large language models (like GPT-3), while powerful, often behave in ways that do not align with user expectations. Specifically, researchers aimed to use specific training methods to make language models better at following user instructions and generating outputs that are more helpful, honest, and harmless, i.e., to improve the model's "alignment" level.

  • Model Basis: This research work conducted subsequent fine-tuning and alignment based on OpenAI's GPT-3 series of pretrained language models.
  • Three-step Alignment Process: The alignment process of InstructGPT is its core research design, comprising the following three key steps 5:
    1. Step 1: Collect demonstration data, and train a supervised policy / SFT.
      • Data Collection: Human labelers were invited to write high-quality, desired outputs for a series of prompts. Some of these prompts came from real user inputs on the OpenAI API, while others were written by labelers themselves based on preset scenarios and requirements to ensure data diversity.
      • Model Training: These human-written "prompt-demonstration output" pairs were used to supervised fine-tune (SFT) the GPT-3 pretrained model. The goal of this stage was to let the model initially learn how to generate responses that meet basic requirements according to instructions. Approximately 13,000 training prompts were used in this stage.
    2. Step 2: Collect comparison data, and train a reward model / RM.
      • Data Collection: Given a prompt, the SFT model trained in the first stage was used to generate multiple different outputs (e.g., by adjusting sampling temperature).
      • Human Preference Labeling: Human labelers compared and ranked these multiple outputs generated by the SFT model, indicating which output was better, or what preference relationship existed between them.
      • Reward Model Training: This human preference comparison data was used to train an independent reward model (RM). The input to the RM was a "prompt-model output" pair, and the output was a scalar score representing the degree of human preference for this model output. The RM's goal was to learn to simulate human judgment criteria. Approximately 33,000 training prompts were used in this stage to train the RM.
    3. Step 3: Optimize a policy against the reward model using PPO / RLHF.
      • Reinforcement Learning Environment: The RM trained in the second stage was used as the reward function in the reinforcement learning environment.
      • Policy Optimization: The Proximal Policy Optimization (PPO) algorithm was used to further fine-tune the SFT model obtained in the first stage (now acting as the policy network in the PPO algorithm). The goal of the PPO algorithm was to adjust the parameters of the policy network so that its generated outputs could receive higher reward scores from the RM, thereby making its behavior more aligned with human preferences. Approximately 31,000 prompts from the API were used for PPO training in this stage. Steps 2 and 3 could be iterated, i.e., collecting new comparison data based on the current optimal policy to train new RMs and policies.
  • Data Sources and Composition: InstructGPT's training data primarily came from real user text prompts collected from early InstructGPT models on the OpenAI API Playground interface, as well as prompts specially written by labelers to guide the model to learn specific capabilities (such as following complex instructions, generating text in specific formats, etc.). These prompts covered a very wide range of task types, such as open-ended text generation, question answering, brainstorming, dialogue, text rewriting, summarization, classification, and information extraction.5
  • Human Data Collection: OpenAI employed approximately 40 contract workers (through platforms like Upwork and ScaleAI) to perform demonstration data writing and comparison data labeling tasks, and to participate in the final model evaluation. These labelers all underwent screening tests to ensure they could identify and appropriately respond to sensitive content. During the labeling process, researchers provided labelers with detailed labeling guidelines and continuous support, and emphasized that model outputs should prioritize usefulness to the user, while also striving for truthfulness and harmlessness.5
  • Evaluation Method: InstructGPT's performance was primarily measured through human preference evaluation, i.e., having labelers directly compare the outputs generated by InstructGPT and baseline models (such as original GPT-3, SFT-only models, etc.) for the same prompt, and judge which was superior. In addition to subjective preference evaluation, researchers also examined the model's objective metrics on some public NLP datasets, such as evaluating truthfulness on the TruthfulQA dataset and output toxicity on datasets like RealToxicityPrompts.5

The three-stage alignment process proposed by InstructGPT, particularly the introduction of Reinforcement Learning from Human Feedback (RLHF), marked an important turning point in the development of large language models: a shift from merely pursuing the model's ability to "generate text" to pursuing its ability to "generate high-quality text that humans expect".5 The core of this transformation lies in RLHF's success in converting originally difficult-to-quantify human preferences (such as whether an answer is "useful," "reasonable," or "interesting") into a concrete reward signal that the model can understand and optimize. There is often a significant discrepancy between the traditional training objectives of language models (such as maximizing the probability of predicting the next word) and users' expectations of the model in practical applications (such as hoping the model provides helpful, truthful, harmless, and instruction-compliant responses). This is known as the "misalignment" problem.5 The SFT stage provides the model with initial instruction-following capabilities by directly imitating human demonstrations. However, human expectations are often complex and subtle, and difficult to fully capture through limited demonstration samples. RLHF, by directly optimizing the human preference signal proxied by the reward model, performs a more refined and comprehensive adjustment of the model's behavior, enabling it to better adapt to these complex expectations. In this process, the reward model plays a crucial role; it acts like a "translator," "encoding" subjective, multi-dimensional human judgments into a scalar reward value that the model can use to guide its behavior optimization during reinforcement learning. The PPO algorithm then provides a method that can effectively explore new, potentially better behavioral policies under this reward signal, while also avoiding excessive deviation from the initial policy (SFT model) that could lead to training instability. InstructGPT's success greatly promoted the application of RLHF in the alignment of large language models, making it one of the core techniques for aligning many subsequent advanced models (including Llama 2-Chat, ChatGPT, etc.). At the same time, InstructGPT's work also triggered ethical reflections in academia and industry regarding "alignment" itself: to whose preferences are we aligning the model? 5 The background and values of the labelers, as well as the design of the labeling guidelines, profoundly affect the final model's behavior. This poses crucial and urgent questions for the future development direction of alignment technology, and how to design and implement fairer, more transparent, and more representative alignment processes.

2.4 MiniGPT-4: Deep Fusion of Vision and Language (MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS) (arXiv: 2304.10592)

The core research idea of MiniGPT-4 6 was to explore a lightweight method by aligning a pretrained and parameter-fixed visual encoder with an advanced, similarly parameter-fixed LLM (specifically Vicuna). The goal was to see if it could reproduce, at a lower computational cost, various advanced multimodal capabilities demonstrated by models like GPT-4, such as generating websites from hand-drawn sketches, identifying humorous elements in images, or producing imaginative descriptions of image content.

  • Model Architecture: MiniGPT-4's architecture reflects the principles of "modular reuse" and "minimizing trainable parameters."
    • Visual Encoder: The model employed a pretrained Vision Transformer (ViT), specifically ViT-G/14 from EVA-CLIP, combined with the Q-Former from the BLIP-2 model. Importantly, both ViT and Q-Former remained parameter-frozen during MiniGPT-4's training, not participating in gradient updates. The role of Q-Former here was to convert visual features extracted from ViT into a fixed number of query vectors that the LLM could understand.6
    • Large Language Model (LLM): MiniGPT-4 selected Vicuna as the core for its language understanding and generation. Vicuna itself is based on the LLaMA model and obtained through instruction fine-tuning, possessing strong dialogue and instruction-following capabilities. Similar to the visual encoder, Vicuna's parameters also remained frozen during MiniGPT-4's training.6
    • Alignment Layer: Connecting the frozen visual module and the frozen LLM module was a very simple, and the only trainable part of the model: a linear projection layer. This projection layer was responsible for projecting the visual feature vectors output by Q-Former into a space with the same dimensionality as the LLM's word embedding space. These projected visual features then served as a "soft prompt" to the LLM, guiding it to generate text based on the image content.6
  • Two-stage Training Approach: To achieve effective vision-language alignment and enhance generation quality, MiniGPT-4 adopted a two-stage training method 6:
    1. First Stage: Pretraining for Vision-Language Knowledge.
      • Data: In this stage, the model was trained on a relatively large-scale dataset of image-text pairs. For example, a subset containing about 5 million samples selected from the Conceptual Captions 3M (CC3M) dataset was used.22 These image-text pairs usually consist of an image and its corresponding short description or title.
      • Objective: The main goal of this stage was to enable the trainable projection layer to learn how to effectively map visual features to a representation space that the LLM could understand, thereby allowing the LLM to establish a basic understanding of image content.
      • Problem Encountered: Researchers found that if only these image-text pairs with short image captions were used for first-stage training, the model, although able to understand images, often produced unnatural language, with issues like repetition and fragmented sentences.6
    2. Second Stage: Fine-tuning for Generation Reliability and Usability.
      • Data: To address the problem of unnatural language generation in the first stage, researchers carefully curated and filtered a smaller (about 3,500 pairs) but very high-quality dataset of image-text pairs with highly detailed descriptions.6 These texts were no longer simple captions but richer, more detailed descriptions of the image content.
      • Objective: The model pretrained in the first stage (mainly the projection layer) was further fine-tuned on this high-quality small dataset. The goal of this stage was to significantly improve the natural fluency and overall usability of the language generated by the model, enabling it to produce more detailed, coherent descriptions and dialogues that better matched human expression habits. A specific chat template was also used during fine-tuning to organize inputs and outputs, better adapting to conversational interaction.9
  • Data Collection and Processing: As mentioned above, the first stage primarily relied on publicly available large-scale image-text pair datasets. The second stage, however, depended on a high-quality, small-scale dataset of detailed image descriptions collected and curated by the research team, which was crucial for improving the model's final output quality.6
  • Experiments and Evaluation: The MiniGPT-4 paper primarily showcased its emergent advanced multimodal capabilities through numerous qualitative examples. These examples included: generating highly detailed and imaginative image descriptions, directly generating runnable website code from user-drawn sketches, creating stories or poems inspired by given images, providing detailed cooking steps and recipes after observing food pictures, and identifying problems shown in pictures and offering solutions.6

MiniGPT-4's research design embodies a philosophy of "lightweight alignment," its core being the efficient "grafting" of the capabilities of two powerful pretrained unimodal models (a visual encoder and an LLM) by minimizing trainable parameters (just a linear projection layer) and employing a clever staged optimization strategy.6 The significant advantage of this method is its computational efficiency. Since most parameters (visual encoder and LLM) remain frozen, the training cost is relatively low (reportedly only about 40 A100 GPU hours 9), enabling more researchers and institutions to conduct such cutting-edge multimodal language model research, greatly lowering the technical barrier. This design strategy allows the model to quickly leverage the strong prior knowledge and capabilities of existing SOTA unimodal models without needing to learn these fundamental abilities from scratch. The projection layer here acts as a "translator" or "adapter," tasked with translating the "language" of the visual modality (i.e., visual features) into a "language" that the LLM can understand and process (i.e., the LLM's embedding representations). Despite its very simple structure, MiniGPT-4 exhibited various impressive, even somewhat unexpected, emergent capabilities.6 This strongly suggests that the powerful general reasoning and generation capabilities inherent in advanced LLMs (like Vicuna) can be effectively stimulated and transferred to new modalities (like vision) through appropriate alignment mechanisms. This "modular components + lightweight alignment" approach provides a highly attractive and efficient paradigm for building multifunctional, multimodal AI systems. Future research may build upon this to explore more complex yet still efficient projection layer designs (e.g., introducing a small amount of non-linearity or attention mechanisms), more optimized cross-modal alignment strategies, and how to extend this successful "grafting" method to more modality combinations (e.g., some ideas in Video-LLaMA in the audio-visual domain also reflect similar modular thinking 16). At the same time, MiniGPT-4's two-stage training method once again highlights the extreme importance of high-quality, task-specific fine-tuning data (such as the detailed image description data used in its second stage 9) for improving the final output quality, naturalness, and practicality of the model.

2.5 Video-LLaMA: Unified Audio-Visual Understanding Model (Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding) (arXiv: 2306.02858)

The core goal of Video-LLaMA 16 is to build a system that allows Large Language Models (LLMs) to simultaneously understand the visual dynamic information and synchronous auditory content contained in videos, and to engage in dialogue with users based on this comprehensive understanding. This is more challenging than tasks involving only static images or a single audio modality.

  • Model Architecture 16: Video-LLaMA's overall architecture is built around a frozen LLM (e.g., Vicuna) and processes visual and auditory inputs through two main branches, ultimately fusing this multimodal information before feeding it into the LLM.
    • LLM Backbone: Serves as the core "brain" of the model, responsible for language understanding, reasoning, and generation. Its parameters remain frozen during training to preserve its powerful language capabilities.
    • Vision-Language Branch: This branch is responsible for processing the visual content in videos.
      • Visual Encoder: First, a pretrained image encoder (e.g., ViT-G/14 from EVA-CLIP) is used to extract static image features from each frame of the video. Then, borrowing from BLIP-2's design, a pretrained Q-Former is introduced to perform initial aggregation and dimensionality reduction on these frame-level features.
      • Video Q-former: This module is specifically designed for processing video sequences. It receives frame-level representations from the image encoder and image Q-Former and encodes the temporal order information between video frames by introducing position embedding. The role of the Video Q-Former is to further integrate these time-stamped frame-level features to generate compact representations that can represent the entire video clip or key visual content.
      • Linear Projection Layer: Finally, a linear projection layer maps the video representations output by the Video Q-Former to the same dimensional space as the LLM's word embeddings, serving as visual soft prompts for the LLM.
    • Audio-Language Branch: This branch is responsible for processing the audio content in videos.
      • Audio Encoder: Video-LLaMA innovatively uses the ImageBind model as its audio encoder. ImageBind is a powerful multimodal encoder, whose salient feature is its ability to map input information from different modalities (such as images, text, audio, depth maps, etc.) into a unified, shared embedding space. This means that in ImageBind's embedding space, semantically similar images and audio are represented as close vectors.
      • Audio Q-former: Similar to the video branch, the audio branch also uses a Q-Former structure (sharing a similar architecture with the Video Q-Former) to fuse features extracted from different audio segments by ImageBind, generating fixed-length audio representations.
      • Linear Projection Layer: Similarly, a linear projection layer maps the audio representations output by the Audio Q-Former to the LLM's embedding space.
  • Training Strategy (Multi-branch Cross-Modal Training) 16: Video-LLaMA's vision-language branch and audio-language branch are trained separately, and both adopt a two-stage training process.
    • Vision-Language Branch Training:
      • First Stage (Pre-training): Pretrained on large-scale video-caption pair datasets (e.g., Webvid-2M, containing about 2 million video clips and their text descriptions) and image-caption pair datasets (e.g., CC595k). The training task is video-to-text generation, i.e., given the visual representation of a video, the frozen LLM generates the corresponding text description. The goal of this stage is to enable the model (mainly the Video Q-Former and projection layer) to learn the basic correspondence between video content and natural language.
      • Second Stage (Fine-tuning): Although the pretrained model has some video understanding ability, its ability to follow complex instructions and conduct fluent dialogue may have declined. Therefore, it needs to be fine-tuned on a dataset containing higher-quality instruction data. This instruction data can come from various visual question answering or visual dialogue datasets, such as the image detailed description data generated by MiniGPT-4, LLaVA's image instruction data, and video instruction data provided by the Video-Chat project. This stage aims to improve the model's performance in instruction following, detail understanding, and dialogue interaction.
    • Audio-Language Branch Training:
      • Challenges Faced: Compared to visual-text pair data, high-quality, large-scale audio-text pair data (especially audio descriptions synchronized with video content) is relatively scarce, which poses difficulties for directly training the audio branch.
      • Innovative Method: Video-LLaMA cleverly utilizes ImageBind's multimodal alignment features to solve this problem. Specifically, the training of the audio-language branch does not directly use audio-text data but reuses the same visual-text data as the vision-language branch for training. The underlying logic is: since ImageBind has already embedded semantically similar images and audio into close feature spaces, when training the audio Q-Former and projection layer to fit the text output expected by the LLM for visual input (produced by the vision branch), the audio branch is actually indirectly learning how to "translate" audio features (encoded by ImageBind) into representations that the LLM can understand. In other words, by learning "what to say when seeing this picture," the model is also learning "what to say when hearing this sound (semantically related to this picture)." Therefore, despite not being explicitly trained on audio-text pairs, Video-LLaMA can still exhibit surprising zero-shot audio understanding capabilities during inference.16
  • Data Collection and Processing: As mentioned above, Video-LLaMA's training relies on large-scale public video/image caption datasets for pretraining, and a series of high-quality visual instruction datasets for fine-tuning. The training of its audio branch cleverly reuses the training data of the visual branch through the features of ImageBind, avoiding strong dependence on large-scale audio-text annotation data.16
  • Experiments and Evaluation: The Video-LLaMA paper primarily demonstrates its dialogue and understanding capabilities in various multimodal scenarios through a series of qualitative case studies. These cases cover audio-visual joint question answering (e.g., asking what a person in a video said while something happened on screen), temporal dynamic capture (e.g., describing the motion trajectory of objects or the sequence of events in a video), static image understanding (as a special case of video), and common-sense concept recognition (e.g., recognizing celebrities or landmarks in a video and answering related questions).16

Video-LLaMA's core innovation lies in its clever use of ImageBind, a "universal multimodal connector".16 Faced with the common problem of scarce audio-text paired data, Video-LLaMA did not attempt to collect and annotate massive amounts of such data. Instead, it took an unconventional approach, achieving audio-text alignment by "borrowing" from the more mature and abundant visual-text training data.16 The success of this strategy relies on ImageBind's ability to map semantic information from different modalities (especially images and audio) into the same or at least highly similar feature spaces.16 When the audio branch's Q-Former and projection layer are trained to output representations that match the text expected by the LLM when processing corresponding visual input, the audio branch is indirectly learning how to "translate" or "align" audio features to semantic representations understandable by the LLM, because ImageBind has already ensured the proximity of related audio and visual concepts in the embedding space. This strategy of "transfer learning" or "cross-modal knowledge distillation" provides a highly efficient and practically feasible approach for multimodal learning when data for specific modalities is scarce. This has important implications for building broader and more comprehensive multimodal AI systems: if a sufficiently powerful "universal embedding space" (like that pursued by ImageBind and its successors) covering multiple modalities exists, then even if paired data between certain specific modalities is insufficient, we might be able to indirectly achieve alignment of these data-scarce modalities with LLMs by utilizing other, more abundant modal paired data. This undoubtedly highlights the extreme importance of developing more powerful and universal multimodal encoders (as the "common language" foundation for different modal information entering LLMs) in future multimodal research.

2.6 Synchronous LLMs: Implementing Full-Duplex Dialogue Agents (Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents) (arXiv: 2409.15594)

The core objective of the research "Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents" 10 is to address a fundamental flaw in the interaction naturalness of most current voice dialogue models: they are mostly half-duplex, meaning the user and AI assistant need to take turns speaking. The AI typically needs to wait for an explicit end cue from the user (like a pause) or detect a silence event before it can begin responding. This mode is far removed from the fluid, dynamic way humans converse. Human dialogue is "full-duplex," allowing both parties to speak and listen simultaneously, thereby enabling rapid turn-taking, natural speech overlap (e.g., one party starting to respond or interject feedback before the other has finished speaking), and real-time backchanneling (short responses like "uh-huh," "yes," indicating listening and understanding).12

  • Key Challenges: The main technical obstacle to achieving such full-duplex dialogue is that traditional pretrained Large Language Models (LLMs) themselves lack a concept of "time." LLMs typically process discrete sequences of text symbols and lack an inherent perception of the passage of real-world time, the start and end of speech signals, and the need for precise synchronization between dialogue participants.11 Therefore, to enable LLMs to participate in full-duplex dialogues requiring precise temporal coordination, it is necessary to somehow introduce a time dimension to them.
  • Model Architecture and Method (Synchronous LLMs):
    • Time Integration: To overcome the challenge of LLMs' timelessness, researchers designed a new mechanism aimed at effectively integrating time information into the internal workings of LLMs (specifically, Llama3-8b model was used in experiments). Through this mechanism, the LLM's response generation can be synchronized with a real-world physical clock, laying the foundation for achieving full-duplex interaction.11
    • Full-Duplex Architecture: The model is designed to simultaneously process a continuous input stream from the user (usually a speech stream) and generate an output stream for the AI agent (also a speech stream). A related paper 10 mentions a novel duplex speech-to-speech (S2S) architecture with continuous user input and codec-processed agent output, directly modeling simultaneous user and agent speech streams through channel fusion.
    • Handling Barge-in: An important feature of full-duplex dialogue is allowing the user to interrupt at any time while the AI is speaking. Therefore, the model design of Synchronous LLMs also aims to support this user barge-in behavior and make real-time adaptive adjustments.10
  • Training Data:
    • To train Synchronous LLMs, researchers adopted a mixed data strategy. They first utilized a vast amount of synthetic speech dialogue data generated from text dialogue data, totaling an astonishing 212,000 hours.11 This synthetic data might be used to teach the model basic dialogue flow, language patterns, and preliminary concepts of temporal coordination.
    • Subsequently, they used a relatively small amount (about 2,000 hours) of real-world recorded speech dialogue data to further fine-tune the model.11 This real data is crucial for the model to learn the more subtle and natural interaction dynamics of human dialogue (such as real pauses, changes in speech rate, emotional expression, and complex interruption and overlap patterns).
    • Another related work 10 mentioned that using a pretrained streaming speech encoder to process user input can eliminate the need for specialized pretraining of the entire speech front-end when building a duplex S2S model, thereby simplifying the model construction process.
  • Experiments and Evaluation:
    • Researchers primarily evaluated the performance of their Synchronous LLMs from two aspects: dialogue meaningfulness (i.e., whether the model's generated responses are relevant, informative, and logical) and dialogue naturalness (i.e., whether the interaction process is smooth and sounds like human-to-human dialogue).11
    • To more intuitively demonstrate the model's full-duplex dialogue capabilities, they also conducted a simulation experiment: letting two Synchronous LLM agents trained on different datasets converse with each other. In this simulation, researchers also considered and introduced network latencies of up to 240 milliseconds to test the model's interaction robustness in environments close to real internet conditions.11
    • The duplex S2S architecture proposed in 10, by directly modeling synchronous streams through continuous user input and channel fusion, claims superiority over previous duplex models in terms of inference, turn-taking, and barge-in handling capabilities.

The research on Synchronous LLMs marks an important deepening in the development direction of conversational AI: moving from primarily pursuing the "content intelligence" of model-generated content (i.e., whether the answers are correct and comprehensive) further towards pursuing the "interaction intelligence" of the interaction process itself (i.e., whether the communication is natural and efficient). The core breakthrough of this work lies in attempting to endow LLMs with "time perception" and "synchronous processing" capabilities 11, which is considered a key step towards achieving truly natural and fluent human-computer dialogue. Current mainstream dialogue systems, even those based on powerful LLMs, mostly operate in a half-duplex "question-answer" mode, lacking the subtle, real-time synchronous dynamics found in human conversation.12 One fundamental reason for this is that LLMs themselves are trained and infer based on discrete, untimestamped text sequences, and they inherently lack the ability to process and respond to time-precisely related phenomena in dialogue (such as when the other party starts speaking, pauses, or might insert feedback or interruptions).11 Synchronous LLMs attempt to remedy this deficiency by designing specific mechanisms to integrate time information into the LLM's operation.11 Only when the model can perceive dynamic changes in the dialogue flow in real-time and adjust its listening and speaking behavior accordingly can true full-duplex natural interaction be achieved. If this technology matures and becomes widely adopted, it will undoubtedly greatly enhance the user experience of existing voice assistants (like Siri, Alexa, etc.) and may even spawn entirely new application scenarios, such as more natural collaborative robots, real-time simultaneous interpretation systems, or highly anthropomorphic virtual companions. However, this also brings new and extremely challenging technical problems: How to effectively represent and utilize time information within the complex network structure of LLMs? How to pursue real-time response without significantly sacrificing the powerful language understanding and reasoning capabilities of LLMs themselves? And, how can large-scale, high-quality dialogue data containing rich synchronous interaction phenomena be acquired or generated for model training? It is worth noting that another related work 19 proposed the NTPP (Next-Token-Pair Prediction) paradigm, which is also an exploration to enhance dialogue synchronicity. NTPP attempts to learn turn-taking behavior without explicit voice activity detection (VAD) by directly modeling the joint speech distribution of both dialogue parties. This shares a common philosophy with Synchronous LLMs' pursuit of more natural interaction, both pointing towards a deeper modeling of dialogue dynamics.

2.7 Mini-Omni: Language Models Supporting Streaming Hear-Talk Interaction (Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming) (arXiv: 2408.16725)

The core goal of Mini-Omni 13 is to develop an end-to-end, audio-based dialogue model capable of real-time, streaming voice interaction. This means the model can not only understand user's voice input but also think while the user is speaking and generate voice output in a streaming manner, achieving a natural dialogue experience similar to human "hearing, talking while thinking." Simultaneously, this research aims to maximize the preservation of the powerful language understanding and reasoning capabilities of its underlying large language model while achieving this advanced interaction capability, and to solve the significant latency problem caused by traditional voice dialogue systems' reliance on external text-to-speech (TTS) modules.

  • Model Architecture and Method:
    • End-to-End Design: Mini-Omni pursues a complete model chain from audio input directly to audio output, avoiding the traditional cascaded architecture of "ASR (Automatic Speech Recognition) + LLM (Language Model) + TTS (Text-to-Speech)." This cascaded architecture not only accumulates errors from each module but also easily introduces non-negligible latency due to module switching and data conversion, affecting the real-time nature of interaction.13
    • Text-instructed Speech Generation: This is one of the core methods Mini-Omni uses to achieve voice output. It may mean that the LLM first generates internal text representations or instructions, and then these text representations or instructions guide the model's internal speech generation module to produce corresponding speech waveforms or acoustic features.13
    • Parallel Generation: To effectively preserve the LLM's powerful text processing and reasoning capabilities while introducing voice output capability, and to reduce potential interference from the audio modality on text capabilities, Mini-Omni proposes a parallel generation paradigm. In this paradigm, the model's Transformer backbone is designed to simultaneously produce audio-related tokens and text-related tokens.14 This design might allow the text stream to act as an "internal thought" or "logical skeleton," while the audio stream performs real-time sound rendering based on this.
    • Batch-parallel Strategies during Inference: During model inference (i.e., actual runtime), Mini-Omni further adopts batch-based parallel strategies to improve performance. This is particularly helpful in enhancing the model's real-time inference capability and response speed during streaming audio output.13
    • "Any Model Can Talk" Method: This is a training method proposed by Mini-Omni aimed at lowering the barrier for other researchers to add voice interaction capabilities to their own LLMs. Its core idea is that by making minimal modifications to the original LLM (e.g., by introducing additional adapter modules and utilizing pretrained acoustic models), and using a relatively small amount of high-quality speech data (possibly synthesized by other models like GPT-4o) for fine-tuning, existing LLMs can quickly be endowed with voice output capabilities. This method, combined with the aforementioned parallel modeling strategy, can achieve streaming output in the newly introduced voice modality while preserving the core capabilities of the original LLM as much as possible.13
  • Dataset (VoiceAssistant-400K): The research team found that existing open-source question-answering (QA) datasets have some shortcomings when used for training audio assistants (e.g., they may lack the naturalness of dialogue and the specific tone of voice assistants). To this end, they proposed a new dataset specifically designed to optimize voice model output, called VoiceAssistant-400K. This dataset reportedly contains about 400,000 samples, synthesized using advanced models like GPT-4o, and aims to enable the model to learn a voice quality and expression style closer to real voice assistants through fine-tuning.13
  • Streaming: Mini-Omni's entire model design aims to support streaming input and output, striving to achieve true real-time voice interaction, making users feel like they are conversing with a partner who can respond instantly.13
  • Experiments and Evaluation: The Mini-Omni paper mentions that they conducted a series of experiments to evaluate the model's capabilities in tasks such as audio input understanding, audio output generation, and automatic speech recognition (ASR). Additionally, they paid special attention to the degree of impact on the original LLM's language capabilities after introducing the audio modality, as well as the actual effects of various proposed inference methods (such as parallel generation, batch-parallel) and the differences between different variants.14

Mini-Omni's research design, particularly its proposed "parallel generation" strategy and "Any Model Can Talk" method, profoundly reflects a core trade-off faced in building practical, advanced voice dialogue systems in the future: how to maximize the preservation and utilization of the powerful core reasoning capabilities already possessed by its underlying large language model while pursuing the real-time and naturalness of end-to-end voice interaction.13 Performing complex reasoning and planning directly on the original audio modality is itself a highly challenging task, which can easily lead to incoherent or logically chaotic model outputs.14 At the same time, introducing a new modality (like voice input/output) and a new task (like real-time speech synthesis) to an already very large and complex LLM, if not handled properly, is very likely to interfere with or weaken the model's original, hard-won powerful understanding and reasoning capabilities in the text domain. The solutions proposed by Mini-Omni, such as having the model generate text and audio tokens in parallel, can be seen as an attempt to let the text stream (representing more abstract semantics and logic) "guide" or "anchor" the audio stream (representing more concrete acoustic implementation), thereby helping to maintain the coherence and logicality of the output content. The "Any Model Can Talk" method, by introducing adapter modules and fine-tuning with only a small amount of synthetic data, attempts to graft voice interaction capabilities onto existing LLMs in a "lightweight" and "low-intrusive" manner, avoiding the uncertainty and performance degradation risks that could arise from large-scale modifications to the LLM's main structure. This design philosophy—how to maximize the reuse and preservation of the core advantages of existing powerful models (like LLMs) while pursuing new functionalities (like real-time voice interaction)—has important guiding significance for building more complex and intelligent multimodal interactive systems in the future. This may further promote the development of modular model design, efficient adapter technology, and more sophisticated multi-task multimodal learning frameworks. Mini-Omni's open-source commitment 13 will also provide valuable code and model resources for the community's exploration in this frontier direction, accelerating research and iteration of related technologies.

III. Main Research Results and Contributions

This chapter will outline the main research results, key innovations, significance to relevant fields, identified limitations, and future research outlook for each of the seven core papers.

3.1 Llama 2: Open, High-Performance, and Responsible LLM (arXiv: 2307.09288)

  • Core Results: Llama 2 successfully released a series of pretrained language models (Llama 2) and their dialogue-optimized fine-tuned versions (Llama 2-Chat), with parameter scales ranging from 7 billion to 70 billion. In most public benchmark tests, Llama 2-Chat ranked among the top open-source chat models, and its performance in terms of helpfulness and safety, based on human evaluations, was considered comparable to some well-known closed-source commercial models.1
  • Innovation:
    • Openness of Methodology: One of Llama 2's most significant innovations is the detailed disclosure of its entire development process, including pretraining data composition and hyperparameters, supervised fine-tuning (SFT) data strategy, specific implementation details of reinforcement learning from human feedback (RLHF) (e.g., innovatively using separate helpfulness and safety reward models, and combining rejection sampling with the PPO algorithm for iterative optimization), and a series of alignment techniques aimed at enhancing model safety (such as safety context distillation, extensive red teaming).1
    • Ghost Attention (GAtt) Mechanism: To improve the model's consistency in multi-turn dialogues and its ability to remember initial instructions, Llama 2 proposed and validated the effectiveness of the GAtt mechanism.1
  • Significance to the Field: The release of Llama 2 has had a profound impact on the entire LLM field. It not only provided academia and industry with a powerful, freely usable open-source base model but, more importantly, by disclosing its detailed training and alignment methods, it greatly promoted the development of the open-source LLM ecosystem. This has fostered transparency and democratization in AI safety and alignment research, enabling a broader range of researchers to participate in efforts to build more responsible and controllable LLMs.1
  • Limitations:
    • Language and Cultural Bias: Llama 2's training and testing were primarily focused on English, so its capabilities in other languages are relatively limited, and it may exhibit certain cultural biases.1
    • Content Risks: Despite extensive safety alignment work, Llama 2 may still generate inaccurate or untruthful information (hallucinations), or in some cases, produce harmful or biased content.1
    • Over-conservatism: Safety tuning can sometimes lead the model to behave overly cautiously or refuse to answer certain harmless prompts, affecting user experience.1
  • Future Outlook: The Llama 2 research team stated their commitment to continuously improving the Llama 2-Chat model, aiming to further enhance its usefulness, truthfulness, and safety, and potentially expand its capabilities in more languages and tasks.1

3.2 FLAN: Instruction Fine-tuning Unlocks Zero-Shot Capabilities (arXiv: 2109.01652)

  • Core Results: FLAN's research strongly demonstrated that "instruction tuning" of large-scale pretrained language models is an effective method to significantly improve their zero-shot learning capabilities on entirely new tasks not seen during the model's training phase. Specifically, the instruction-tuned FLAN model (137 billion parameters) outperformed the zero-shot performance of the larger, non-instruction-tuned GPT-3 model (175 billion parameters) on 19 out of 25 evaluated NLP tasks. Even more impressively, on several challenging tasks including ANLI, RTE, and BoolQ, FLAN's zero-shot performance significantly surpassed GPT-3's few-shot performance.3
  • Innovation: FLAN's innovation lies in its systematic exploration and validation of the instruction tuning method. Researchers not only constructed a large-scale, diverse fine-tuning dataset comprising over 60 different NLP tasks described via natural language instructions but also revealed key factors influencing the success of instruction tuning through ablation studies. These factors include the number of tasks involved in fine-tuning (more is better), the scale of the base model (larger is better), and the natural language formulation of the instructions themselves (using natural language instructions is more effective than using only examples).3
  • Significance to the Field: FLAN's work provided a very effective and relatively efficient path for enhancing the generalization ability and usability of large language models. It showed that by teaching models "how to follow instructions," they can better utilize their existing pretrained knowledge to adapt to new tasks without requiring large amounts of labeled data for each new task. This discovery greatly spurred a subsequent wave of research and models based on instruction tuning and became a key technology for improving the universality and dialogue capabilities of modern LLMs (such as GPT-3.5, GPT-4, Llama 2-Chat, etc.).2
  • Limitations:
    • Task Type Dependence: The authors of FLAN noted that when the goal of the fine-tuning task is very similar to the original language model's pretraining objective (e.g., predicting the next word), the improvement brought by instruction tuning might be less significant.2
    • Instruction Complexity: The instructions used in FLAN were relatively simple, usually single-sentence descriptions.2 Their effectiveness for more complex, multi-step reasoning instructions remains to be validated.
    • Language and Alignment: FLAN's research primarily focused on English and did not deeply consider model safety alignment issues (such as avoiding harmful outputs) in its original work.2
  • Future Outlook: Based on FLAN's initial success, future research directions naturally include exploring more complex and natural instruction formats, extending instruction tuning to larger-scale models and a wider range of task types, and combining instruction tuning with more advanced alignment techniques like RLHF, with the aim of obtaining more powerful and controllable language models.2

3.3 InstructGPT: RLHF Aligns LLMs with Human Intent (arXiv: 2203.02155)

  • Core Results: InstructGPT, through its innovative three-stage alignment process (SFT → RM → RLHF-PPO), successfully enabled language models with relatively small parameter counts (e.g., 1.3 billion) to achieve output quality significantly preferred by human evaluators over that of the much larger (175 billion parameters) original GPT-3 model. Furthermore, the aligned InstructGPT models showed improvements in the truthfulness of generated content, a reduced tendency to produce harmful outputs, and minimal performance degradation on traditional tasks in public NLP dataset evaluations.5
  • Innovation: The core innovation of InstructGPT lies in its systematic application of Reinforcement Learning from Human Feedback (RLHF) to align general-purpose, large-scale language models, enabling them to better follow a wide range of written instructions given by users in natural language. More importantly, its alignment goal explicitly targeted making model behavior more consistent with human expectations across three core dimensions: "helpful" (i.e., can help users solve problems), "honest" (i.e., does not fabricate information or mislead users), and "harmless" (i.e., does not produce discriminatory, violent, or other undesirable content).5
  • Significance to the Field: InstructGPT's work is of landmark significance for the development of LLMs. It not only validated the feasibility and effectiveness of RLHF as a powerful LLM alignment technique but also provided the core methodological foundation for a subsequent series of highly influential models (such as ChatGPT). This research clearly demonstrated that through a careful alignment process, even models with relatively small parameter scales can achieve, or even surpass, the performance of unaligned, larger-scale models in specific human preference dimensions. This pointed the way for how to build AI systems more aligned with human expectations under limited resources.5
  • Limitations:
    • Imperfection of Alignment: Although InstructGPT made significant progress in alignment, it is not perfectly aligned or absolutely safe. The model may still produce harmful, biased, or untruthful outputs in certain situations, or misunderstand, or even completely ignore, some user instructions.5
    • Limitations of Alignment Goals: The model's alignment results are largely influenced by the group of human labelers involved (their backgrounds, values, etc.) and the preferences set by researchers when designing labeling guidelines. This means InstructGPT's "alignment" is specific to certain populations and standards, rather than being universal.5
    • Potential Over-conservatism: Sometimes, to avoid generating inappropriate content, the model might behave overly cautiously or avoid certain topics, even if these topics are harmless, which could affect its usefulness.5
  • Future Outlook: The authors of InstructGPT also pointed out future directions worth exploring, including: how to further reduce the probability of models producing harmful outputs (e.g., through stronger adversarial training or data filtering); how to train models to appropriately refuse to execute improper or harmful user instructions; exploring more efficient and lower-cost human feedback collection mechanisms (e.g., allowing users to directly edit model outputs); and how to design more inclusive and representative alignment processes to reflect the values and expectations of a broader user base.5

3.4 MiniGPT-4: Lightweight Alignment Achieves Advanced Multimodal Understanding (arXiv: 2304.10592)

  • Core Results: MiniGPT-4's research demonstrated that by merely using a trainable linear projection layer to align a parameter-fixed pretrained visual encoder (like ViT + Q-Former) with an advanced, similarly parameter-fixed instruction-tuned LLM (Vicuna), the integrated model could exhibit various impressive advanced multimodal capabilities. These capabilities include generating highly detailed and imaginative image descriptions, directly creating website code from user-drawn sketches, composing stories or poems inspired by given images, and providing cooking instructions and recipes based on food photos—abilities rarely seen in many previous VLMs.6
  • Innovation:
    • Computationally Efficient Multimodal Alignment Method: MiniGPT-4 proposed a very lightweight and computationally efficient multimodal alignment scheme. By freezing the main visual and language modules and training only a simple projection layer, it significantly reduced the computational cost and technical barrier for building powerful vision-language models.9
    • Two-Stage Training Strategy to Address Unnatural Language: To solve the problem of unnatural language generation (e.g., repetition, fragmentation) that can arise from initial alignment using only large-scale image-text pairs with short descriptions, MiniGPT-4 innovatively adopted a two-stage training strategy. The first stage involved pretraining on large-scale image-text pairs to learn basic vision-language correspondences; the second stage used a smaller but extremely high-quality dataset containing detailed image descriptions for fine-tuning, thereby significantly improving the fluency, reliability, and overall usability of the language generated by the model.6
  • Significance to the Field: MiniGPT-4's work provided an effective paradigm for academia and industry to quickly build and explore VLM capabilities. Its concise architecture and efficient training method enabled more researchers to participate in this cutting-edge field, inspiring the development and emergence of many subsequent open-source large vision-language models.6 It revealed the powerful potential of advanced LLMs, showing that their general capabilities can be transferred to new modalities through simple alignment.
  • Limitations:
    • Insufficient Quantitative Evaluation: The MiniGPT-4 paper primarily relied on qualitative examples to demonstrate its capabilities, lacking comprehensive quantitative comparisons with the latest SOTA models on standard VLM benchmarks, which makes objective assessment of its performance somewhat difficult.6
    • Fine-grained Recognition Capability: Since LLMs themselves may not be optimized for fine-grained visual recognition tasks, and the alignment of visual features to LLMs might lose some details, MiniGPT-4 could perform weakly on tasks requiring precise identification of minute objects or subtle differences in images.6
    • Accuracy of Claims: Reviewers noted that some claims in the paper (e.g., regarding the lack of multimodal instruction tuning datasets, or the implication that Q-Former's presence or absence had little impact on results) might be misleading and require more rigorous phrasing.6
  • Future Outlook: Future work could focus on improving the model's performance on fine-grained visual recognition tasks, for example, by exploring how to better preserve and utilize detail information in visual features while maintaining the model's strong cognitive and generative abilities. Achieving a balance between tasks requiring comprehensive cognitive abilities and those needing fine-grained recognition is an important research direction.6

3.5 Video-LLaMA: Exploration of Unified Audio-Visual Understanding (arXiv: 2306.02858)

  • Core Results: Video-LLaMA proposed a multimodal framework aimed at endowing Large Language Models (LLMs) with the ability to simultaneously understand dynamic visual content and synchronous auditory events in videos, and to engage in dialogue with users based on this comprehensive audio-visual understanding.16
  • Innovation:
    • Multi-branch Audio-Visual Processing Architecture: Video-LLaMA designed a multi-branch architecture containing independent visual Q-Formers and audio Q-Formers. The visual branch processes video frame sequences and captures their temporal dynamics, while the audio branch handles the audio stream. Information from these two modalities is processed separately and then fed into the LLM for fused understanding.16
    • Clever Utilization of ImageBind for Audio Alignment: Facing the scarcity of high-quality audio-text paired data, Video-LLaMA innovatively leveraged the characteristics of ImageBind, a universal multimodal encoder. ImageBind can map information from different modalities (including images and audio) into the same shared embedding space. Video-LLaMA's audio branch, by training on visual-text data, indirectly learned the correspondence between audio and language, thereby achieving impressive zero-shot audio understanding capabilities without direct large-scale audio-text data training.16
  • Significance to the Field: Video-LLaMA's work represents an important step towards building AI assistants capable of comprehensively understanding real-world dynamic scenes (which are inherently combinations of audio and video). It provides valuable exploration and practice on how to enable LLMs to process and integrate time-varying visual information and synchronous audio information from videos, which is inspiring for subsequent audio-visual multimodal large model research.16
  • Limitations:
    • Indirectness of Audio Understanding: Since the training of the audio branch primarily relies on ImageBind's pre-alignment capability and the indirect transfer from visual-text data, the depth, accuracy, and robustness of its audio understanding may need further validation and improvement through direct, large-scale audio-text data.
    • Evaluation Method: Similar to many early multimodal models, Video-LLaMA's performance evaluation primarily relied on qualitative case demonstrations, lacking comprehensive quantitative assessment on standardized, large-scale audio-visual understanding benchmarks.
  • Future Outlook: Future research directions could include further enhancing the model's capabilities in audio-visual collaborative understanding, for example, by exploring more direct and effective audio-visual-language joint training methods, rather than relying solely on indirect alignment. Additionally, developing and applying more comprehensive, standardized quantitative evaluation benchmarks is crucial for measuring and comparing the performance of such complex multimodal systems.16

3.6 Synchronous LLMs: Towards Full-Duplex Natural Dialogue (arXiv: 2409.15594)

  • Core Results: The research "Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents" proposed the concept and implementation of Synchronous LLMs. By cleverly integrating time information into Large Language Models like Llama3-8b, enabling them to run synchronously with real-world physical clocks, it achieved, for the first time, truly full-duplex spoken dialogue. Experimental results showed that Synchronous LLMs outperformed existing SOTA technologies in terms of dialogue meaningfulness and relevance while maintaining naturalness.10
  • Innovation:
    • Introduction of Time Synchronization Concept: The core innovation of this work is the first attempt to introduce the key concept of "time synchronization" into LLM design to address the fundamental synchronicity challenges faced by traditional LLMs in modeling full-duplex dialogue (such as handling speech overlap, user interruptions, and instant feedback).11
    • Efficient Training Scheme: Researchers proposed a novel training scheme that combined a large amount of synthetic speech dialogue data generated from text dialogue data (approximately 212,000 hours) with a relatively small amount of real-world speech dialogue data (approximately 2,000 hours) to effectively train models capable of full-duplex interaction.11
    • Duplex S2S Architecture (Related Work): Another closely related paper 10 proposed a novel duplex speech-to-speech (S2S) architecture characterized by its ability to handle continuous user speech input and directly model simultaneous speech streams from both user and AI agent through channel fusion technology. This architecture reportedly does not require a specialized speech pretraining process.
  • Significance to the Field: The research on Synchronous LLMs provides new ideas and technical paths for addressing the core pain points of current voice dialogue systems, which generally suffer from unnatural interaction and lack of real-time responsiveness. If this technology matures and is widely adopted, it is expected to greatly enhance the user experience of human-computer voice interaction, making conversations with AI closer to natural human-to-human communication.12
  • Limitations:
    • Mechanism Details and Universality: The specific details of how time information is integrated into the internal workings of LLMs, and whether this mechanism can be broadly applied to LLMs of different architectures, may require further clarification and validation.
    • Network Latency Robustness: Although the paper mentions testing under simulated network latency (up to 240ms), the model's robustness under broader and more complex real-world network conditions still needs further comprehensive examination.
  • Future Outlook: Future research could focus on further optimizing the time integration mechanism and improving the model's full-duplex interaction performance in more complex real-world dialogue scenarios (e.g., multi-party conversations, environments with strong background noise). Additionally, exploring more efficient and lower-cost methods for acquiring and utilizing full-duplex dialogue data (especially real data containing rich synchronous phenomena) will also be an important direction for advancing this field.

3.7 Mini-Omni: Implementation of End-to-End Streaming Audio Dialogue (arXiv: 2408.16725)

  • Core Results: The Mini-Omni project successfully proposed and implemented what is claimed to be the first open-source, end-to-end multimodal large language model. The core feature of this model is its ability to understand audio input and generate streaming audio output, thereby supporting real-time, natural voice interaction. Importantly, Mini-Omni aims to achieve these advanced interactive functions while preserving the powerful language understanding and reasoning capabilities of its underlying LLM.13
  • Innovation:
    • Text-Instructed Parallel Speech Generation: To ensure the real-time nature of voice interaction without sacrificing the LLM's reasoning quality, Mini-Omni proposed a text-instructed speech generation method combined with a parallel generation strategy. Under this strategy, the model might be designed to simultaneously generate internal text representations (for logical control and content planning) and external audio tokens (for sound implementation), attempting to balance interaction fluency and content accuracy.14
    • "Any Model Can Talk" Method: This is an innovative method aimed at lowering the barrier for adding voice interaction capabilities to existing LLMs. Its core idea is that by introducing lightweight adapter modules and fine-tuning with a small amount of (possibly synthetic) high-quality voice data, other pretrained LLMs can quickly acquire voice input/output and real-time interaction capabilities without large-scale structural modifications to the original model.14
    • Dedicated Dataset VoiceAssistant-400K: Addressing potential shortcomings of existing open-source QA datasets for training voice assistants (such as lack of specific tone and naturalness), the Mini-Omni team built and released a new dataset called VoiceAssistant-400K, specifically designed to optimize the timbre and style of the model's output voice, making it closer to professional voice assistants.14
  • Significance to the Field: The open-source release of Mini-Omni provides the academic and open-source communities with a relatively complete and accessible solution and valuable code, model, and data resources for building real-time end-to-end voice dialogue systems. This will undoubtedly help promote research progress in this cutting-edge field and the application and popularization of related technologies, allowing more researchers to participate in exploring more natural and intelligent human-computer voice interaction.13
  • Limitations:
    • "Work in Progress" Status: As stated in its paper, Mini-Omni was still marked as "work in progress" at the time of release 15, meaning its various performance indicators, model stability, and robustness might still be under continuous development and optimization, not yet having reached a final mature state.
    • Complexity of Audio Reasoning: Directly performing complex, high-level semantic understanding and logical reasoning on the original audio modality is itself an extremely challenging task, potentially leading to incoherence or logical errors more easily than in the text modality.14
  • Future Outlook: Future work may focus on further improving the coherence and accuracy of the model in direct audio reasoning, optimizing its overall performance (such as reducing latency, improving voice quality, enhancing robustness to noisy environments), and exploring more efficient and stable streaming processing and parallel generation mechanisms, with the aim of achieving a real-time voice interaction experience closer to human level.

IV. Review and Discussion

4.1 Comprehensive Review: Innovation, Advantages, and Limitations

The seven papers covered in this survey outline the vigorous development trend in the current field of Large Language Models (LLMs) from different dimensions. We can comprehensively review their innovation, advantages, and inherent limitations from three aspects: the construction and alignment of foundational models, the expansion of multimodal capabilities, and the pursuit of real-time and natural interaction.

  • Evolution of LLM Foundational Models and Alignment Techniques:
    • Advantages and Innovation: Research in this direction has laid the cornerstone of modern LLMs. From Llama 2's 1 commitment to building high-performance open-source foundational models and exhaustively disclosing its training and alignment methodologies, to FLAN's 3 pioneering exploration of instruction fine-tuning in unlocking zero-shot learning capabilities, and InstructGPT's 5 success in aligning model behavior with human intent through RLHF, we clearly see that the improvement of LLM's foundational capabilities and the maturation of its behavioral controllability are complementary and indispensable. Llama 2's use of separate helpfulness and safety reward models in RLHF, combined with rejection sampling and PPO for optimization 1, and InstructGPT's classic three-step alignment process (SFT -> RM -> RLHF-PPO) 5, are important innovations in the development of alignment technology.
    • Limitations: Despite tremendous progress, the alignment process of LLMs still heavily relies on large-scale human-annotated data (whether it's demonstration data for SFT or preference data for RLHF), which is not only costly and time-consuming, but the annotation process itself may also introduce biases from the annotator group, leading to "aligned" results that are not universal or absolutely fair.1 Furthermore, the "alignment tax" issue—where alignment operations performed to make the model meet expectations in certain aspects (like safety) may, to some extent, impair the model's original performance or general capabilities on other tasks 5—remains a problem that researchers need to address and try to mitigate. Safety alignment itself is also an ongoing, dynamic challenge; although red teaming and continuous iterative optimization can identify and fix many problems, it is impossible to completely cover all potential risk points due to the infinite complexity of real-world scenarios.1
  • Expansion of Multimodal Capabilities:
    • Advantages and Innovation: Expanding LLM capabilities from the pure text domain to understanding and processing multimodal information such as images, audio, and video is a key step in enhancing AI systems' ability to interact with the physical world. The lightweight alignment method proposed by MiniGPT-4 6 (freezing main modules, only training a projection layer) provides an efficient idea for rapidly building and iterating vision-language

引用的著作

  1. [2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models, 访问时间为 六月 4, 2025, https://ar5iv.labs.arxiv.org/html/2307.09288
  2. Finetuned Language Models Are Zero-Shot Learners, 访问时间为 六月 4, 2025, https://www.cs.toronto.edu/~cmaddis/courses/csc2541_w25/presentations/altintas_zeroshotlearners.pdf
  3. Finetuned Language Models Are Zero-Shot Learners - ResearchGate, 访问时间为 六月 4, 2025, https://www.researchgate.net/publication/354379338_Finetuned_Language_Models_Are_Zero-Shot_Learners
  4. [2109.01652] Finetuned Language Models Are Zero-Shot Learners - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/abs/2109.01652
  5. [2203.02155] Training language models to follow instructions with ..., 访问时间为 六月 4, 2025, https://ar5iv.labs.arxiv.org/html/2203.02155
  6. MiniGPT-4: Enhancing Vision-Language Understanding with ..., 访问时间为 六月 4, 2025, https://openreview.net/forum?id=1tZbq88f27
  7. Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models | PDF | Visual Perception - Scribd, 访问时间为 六月 4, 2025, https://www.scribd.com/document/705435168/Minigpt-4-Enhancing-Vision-language-Understanding-With-Advanced-Large-Language-Models
  8. [2304.10592] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/abs/2304.10592
  9. huggingface.co, 访问时间为 六月 4, 2025, https://huggingface.co/spaces/Vision-CAIR/minigpt4/resolve/6aee8cdef96756f4c96de16b61e724fc1c6c2fee/MiniGPT_4.pdf
  10. Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/html/2505.15670v1
  11. Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/abs/2409.15594
  12. arxiv.org, 访问时间为 六月 4, 2025, https://arxiv.org/html/2409.15594v1
  13. Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - ResearchGate, 访问时间为 六月 4, 2025, https://www.researchgate.net/publication/383530375_Mini-Omni_Language_Models_Can_Hear_Talk_While_Thinking_in_Streaming
  14. arxiv.org, 访问时间为 六月 4, 2025, https://arxiv.org/html/2408.16725v3
  15. Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/abs/2408.16725
  16. [2306.02858] Video-LLaMA An Instruction-tuned Audio-Visual ..., 访问时间为 六月 4, 2025, https://ar5iv.labs.arxiv.org/html/2306.02858
  17. [PDF] LMentry: A Language Model Benchmark of Elementary ..., 访问时间为 六月 4, 2025, https://www.semanticscholar.org/paper/6ae3e52ae55578c10722db3c2f898442f20e336c
  18. OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/html/2503.22952v1
  19. NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction - arXiv, 访问时间为 六月 4, 2025, https://www.arxiv.org/pdf/2506.00975
  20. [Literature Review] Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities, 访问时间为 六月 4, 2025, https://www.themoonlight.io/en/review/full-duplex-bench-a-benchmark-to-evaluate-full-duplex-spoken-dialogue-models-on-turn-taking-capabilities
  21. Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows - arXiv, 访问时间为 六月 4, 2025, https://arxiv.org/html/2505.24189v1
  22. Pixel Understanding with Visual Instruction Tuning, 访问时间为 六月 4, 2025, https://cvpr.thecvf.com/media/cvpr-2024/Slides/31799.pdf