Research Report: Frontier Progress of Large Language Models—From Open Source Foundations to Multimodal Applications

:material-circle-edit-outline: 约 9625 个字 :material-clock-time-two-outline: 预计阅读时间 32 分钟

Abstract
Large Language Models (LLMs) have made breakthrough progress in recent years and are rapidly expanding into the multimodal domain. This report, based on seven representative academic papers, conducts an in-depth investigation and analysis of the open-source training of LLMs, key fine-tuning technologies (Supervised Fine-Tuning SFT and Reinforcement Learning from Human Feedback RLHF), and their applications in multimodal scenarios including vision, audio, and video. The report first explores the construction and training paradigm of open-source LLMs represented by Llama 2. It then details the important roles of SFT and RLHF in enhancing models' instruction-following capabilities and alignment with human intent. Building on this, the report further examines how LLMs extend to vision-language understanding through models like MiniGPT-4, revolutionize the real-time and synchronous nature of audio interaction via models like SyncLLM and Mini-Omni, and tackle complex video content understanding challenges with models like Video-LLaMA. Finally, this report synthesizes the intrinsic connections and synergistic potential of these technologies, summarizes current common challenges, and provides an outlook on future research directions and technological breakthroughs, aiming to offer valuable references for researchers and practitioners in related fields.
Keywords: Large Language Models (LLMs), Open Source, Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), Multimodal, Vision-Language Models, Audio Dialogue, Video Understanding

Introduction

Large Language Models (LLMs) have undergone rapid development in the past few years. Their remarkable abilities in natural language understanding and generation have made them a research hotspot and technological frontier in the field of artificial intelligence. From early models based on the Transformer architecture to today's giant models with hundreds of billions of parameters, LLMs continue to refresh our understanding of their potential. A significant recent trend is the accelerated evolution of LLMs from purely text processing to multimodal information processing. This means that models must not only understand text but also be able to understand and generate various types of data, including images, audio, and video. For example, the highly acclaimed GPT-4 has already demonstrated extraordinary multimodal capabilities, such as directly generating websites from handwritten text and identifying humorous elements in images.1 The development of such multimodal LLMs heralds a profound transformation in human-computer interaction and is expected to create new application paradigms in various fields such as content creation, information retrieval, education, and healthcare.

This report aims to systematically review and analyze seven key academic papers recently published on the preprint platform arXiv. These papers represent the latest advancements in LLMs, from basic model training to advanced multimodal applications. Through an in-depth analysis of this cutting-edge research, the report strives to comprehensively present the core trajectory and developmental dynamics of current LLM technology. The report's structure will follow a logical sequence from foundational to applied, and from unimodal to multimodal. First, it will discuss the construction of open-source LLMs and their importance. Second, it will delve into two key fine-tuning techniques—Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)—and how they improve models' instruction-following capabilities and alignment with human intent. Subsequently, it will examine the extension and application of LLMs in three main multimodal directions: vision, audio (including synchronous dialogue and streaming interaction), and video understanding. Finally, the report will synthesize these technologies, summarize common challenges, and offer an outlook on future research directions.

To help readers quickly understand the core literature on which this report is based and its positioning within the report's structure, the following table provides an overview of these seven papers:

Table 1: Overview of Seven Core Research Papers

Paper Chinese Title (Translated)	Paper English Title	arXiv Identifier	Main Focus/Modality	Core Contribution Summary
Llama 2: Open Source Foundation and Fine-Tuned Chat Models	Llama 2: Open Foundation and Fine-Tuned Chat Models	2307.09288	Open-Source LLM, Text	Developed and released a series of pre-trained and fine-tuned LLMs (Llama 2-Chat) with parameter scales from 7B to 70B, optimized for dialogue scenarios, outperforming most open-source chat models.2
Fine-tuned Language Models are Zero-Shot Learners	FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS	2109.01652	SFT, Text	Proposed instruction tuning (FLAN), significantly improving LLM's zero-shot learning ability on unseen tasks, proving that generalization can be effectively achieved through instruction fine-tuning.4
Training Language Models to Follow Instructions with Human Feedback	Training language models to follow instructions with human feedback	2203.02155	RLHF, Text	Proposed InstructGPT, enabling LLMs to better follow user intent through RLHF, producing more truthful, useful, and harmless content, gaining user preference even with fewer model parameters.6
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS	2304.10592	Vision-Language Model (VLLM)	Proposed MiniGPT-4, which connects a pre-trained visual encoder and an advanced LLM (Vicuna) via a simple projection layer, achieving various advanced multimodal capabilities similar to GPT-4.1
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents	Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents	2409.15594	Audio Dialogue, Time Synchronization	Proposed SyncLLM, which achieves full-duplex spoken dialogue synchronized with a real clock by integrating temporal information into LLMs, enhancing dialogue naturalness and meaningfulness.8
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	2408.16725	Audio Dialogue, Streaming Processing	Proposed Mini-Omni, the first open-source end-to-end real-time speech interaction model, achieving "listen, talk, and think while streaming" through techniques like text-instructed parallel generation and batch parallel decoding.10
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding	2306.02858	Video Understanding, Audio-Visual Fusion	Proposed Video-LLaMA, a multi-branch cross-modal framework enabling LLMs to simultaneously understand visual and auditory content in videos and conduct instruction-driven dialogues.12

Chapter 1: The Cornerstone and Training Paradigm of Open-Source Large Language Models

The research and development of Large Language Models (LLMs) were once dominated by a few institutions with substantial capital and large-scale computing resources. However, with technological advancements and community demand, open-source LLMs have gradually become an undeniable force, injecting new vitality into academic research and industrial innovation. This chapter will use Llama 2 as an example to explore the significance, core architecture, and training paradigm of open-source LLMs.

Core Paper: Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv:2307.09288)
1.1 Significance and Core Architecture of Llama 2
The release of Llama 2 is widely regarded as a significant milestone in the development of open-source LLMs.2 It provides a series of pre-trained models with parameter scales ranging from 7 billion to 70 billion, as well as fine-tuned models optimized for chat (Llama 2-Chat). This openness greatly lowers the barrier for researchers and developers to access and use advanced LLMs. A broader community can conduct secondary development, experiment with new ideas, and customize models for specific application scenarios. This not only accelerates technological iteration and innovation—for example, many subsequent multimodal model studies may be built upon open-source models like Llama—but also promotes joint discussion on LLM safety and ethical issues, and responsible development practices.3 Therefore, open source has become a key driving force for democratizing LLM technology, fostering a thriving ecosystem, and guiding technology towards beneficial development.
Although the specific architectural details of Llama 2 are not elaborated in its abstract, as a series of foundational models, it provides a solid basis for subsequent fine-tuning and optimization for specific scenarios like dialogue interaction. These models are typically based on the Transformer architecture and learn deep language structures and world knowledge by training on large-scale, diverse text corpora.
1.2 Detailed Pre-training and Fine-tuning Process
The development of Llama 2 follows the mainstream "pre-train then fine-tune" paradigm in the current LLM field. This paradigm is a core path for building high-performance LLMs because it effectively balances the model's general knowledge acquisition with its adaptability to specific tasks.
- Pre-training: The Llama 2 series first includes a set of pre-trained models. In the pre-training phase, the model is trained on massive amounts of text data with the goal of learning general language representations, grammatical structures, semantic relationships, and broad world knowledge. This process typically employs self-supervised learning, such as predicting the next token in a text sequence. In this way, the model can internalize the statistical patterns and concepts of language.
- Fine-tuning: After acquiring powerful general capabilities through pre-training, the model is fine-tuned for specific downstream tasks or application scenarios. Llama 2-Chat is the product of fine-tuning Llama 2 for dialogue use cases.2 The datasets used in the fine-tuning phase are usually more targeted, such as dialogue data, instruction data, etc. Through fine-tuning, the model can better understand the input format and expected output behavior of specific tasks. The Llama 2 paper details its fine-tuning methods, particularly the improvements made in Llama 2-Chat to enhance safety and helpfulness, aiming to empower the community to build upon this work and contribute to the responsible development of LLMs.3

This two-stage training process—first building foundational capabilities through pre-training on large-scale unlabeled data, then adapting to specific applications through fine-tuning on small-scale labeled or specially formatted data—has become standard practice in current LLM development and application. Technologies discussed in subsequent chapters, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), are specific refinements and advanced applications within this broader paradigm.

1.3 Optimization Strategies and Effects for Dialogue Scenarios
To enable LLMs to better serve human interaction needs, dialogue capability is a core optimization direction. Llama 2-Chat is an optimized version focusing on dialogue application scenarios.
- Optimization of Llama 2-Chat: Llama 2-Chat employs specific fine-tuning strategies aimed at improving the model's helpfulness and safety in dialogue interactions.3 This means the model must not only understand user intent and provide useful information but also avoid generating harmful, biased, or inappropriate statements. This typically involves fine-tuning with high-quality dialogue data and may incorporate advanced techniques like RLHF to align the model's behavior with human expectations.
- Performance: According to its release information, Llama 2-Chat outperforms other open-source chat models available at the time on most benchmarks. More importantly, results from human evaluations show that Llama 2-Chat's performance in terms of helpfulness and safety makes it a potential suitable substitute for some closed-source models.3 This is undoubtedly a positive signal for users with limited budgets or higher requirements for model controllability.

The emergence of open-source LLMs (like Llama 2) and their continuous optimization for key applications like dialogue not only provide valuable research platforms for academia but also bring more choices and innovation opportunities to industry. They demonstrate the immense potential of the open-source community in advancing LLM technology and lay the foundation for building a more open, collaborative, and responsible AI ecosystem.

Chapter 2: Key Fine-tuning Technologies for Enhancing LLM Instruction Following and Generalization Capabilities

Although pre-trained Large Language Models possess a wealth of knowledge, they do not inherently understand and follow human instructions perfectly in various forms, nor do they necessarily perform excellently on unseen tasks.5 To bridge this gap, researchers have developed various fine-tuning techniques. This chapter will focus on two key fine-tuning methods: Supervised Fine-Tuning (SFT), particularly its application in instruction tuning, and Reinforcement Learning from Human Feedback (RLHF).

2.1 Supervised Fine-tuning (SFT) and its Empowerment of Zero-Shot Learning
- Core Paper: FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS (arXiv:2109.01652)
Supervised Fine-Tuning (SFT) is a method of further training a pre-trained model on labeled data to adapt it to specific tasks. In the LLM domain, a particularly effective form of SFT is "Instruction Tuning."The core idea of SFT (Instruction Tuning) is that by exposing a pre-trained language model to a large number of tasks described in natural language instructions and their corresponding desired outputs, the model can learn to understand the intent of these instructions and generate responses that meet the instruction requirements.5 The central assumption of this method is that many different NLP tasks can be uniformly expressed as responses to some instruction.The FLAN (Finetuned Language Net) model proposed in this paper is strong evidence of the effectiveness of instruction tuning. FLAN is a pre-trained language model with 137 billion parameters, which researchers instruction-tuned on a collection of over 60 NLP tasks. These tasks were all described using natural language instruction templates, for example, a sentiment classification task might be described as "Is the sentiment of this movie review positive or negative?".4A significant improvement in zero-shot learning is the core finding of the FLAN research. Zero-shot learning refers to the model's ability to perform a task without having seen any training samples for that specific task (i.e., zero-shot). FLAN outperformed the zero-shot performance of the larger 175 billion parameter GPT-3 on 20 out of 25 evaluated task types that were not seen during the instruction tuning phase. Even more impressively, on several tasks including ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze, FLAN's zero-shot performance significantly surpassed GPT-3's few-shot performance.4 This result strongly indicates that instruction tuning can effectively unlock the inherent generalization potential of LLMs, enabling them to transfer patterns learned from known instruction tasks to entirely new, unseen instructions and task types. This is not just a simple adaptation to specific tasks, but a fundamental shaping of the LLM's interaction paradigm, making it more sensitive to and understanding of instructions.The research further revealed the key factors for the success of instruction tuning 5:
1. Number and Diversity of Fine-tuning Datasets: The more and more diverse task clusters and datasets included in instruction tuning, the better the model's average performance on unseen tasks. This suggests that the model learns more general instruction understanding and execution capabilities from diverse instructions.
2. Model Scale: The benefits of instruction tuning appear to have a non-linear relationship with model scale. For models reaching a certain scale (e.g., 100 billion parameters), instruction tuning can significantly improve performance. However, for smaller models (e.g., 8 billion parameters and below), instruction tuning might actually impair their performance on unseen tasks. This may imply that the model's "capacity" is important for learning and generalizing a large number of different instructions; smaller models might exhaust their capacity when learning numerous instructions, leading to overfitting or forgetting some pre-trained knowledge, while larger models have sufficient capacity to learn to follow instructions while maintaining their generalization ability.
3. Importance of Natural Language Instructions: Using explicit, human-understandable natural language instructions during training is crucial. If only input/output pairs are used, or if only dataset names are added as prompts before the input, the effect is far inferior to models fine-tuned with complete instructions. This emphasizes the core role of the instruction's phrasing in the model's learning of how to "follow commands."
  2.2 Reinforcement Learning from Human Feedback (RLHF) and its Application in Instruction Alignment
  Core Paper: Training language models to follow instructions with human feedback (arXiv:2203.02155)
Although SFT can teach models to follow instructions, merely following instructions is not equivalent to generating "good" output. This is because the standard of "good" often involves subjectivity, safety, truthfulness, usefulness, and other complex human values that are difficult to define precisely with simple rules.6 Large language models, even with massive parameter counts, can generate outputs that are untruthful (fabricating facts), toxic (containing offensive or discriminatory content), or simply unhelpful to the user. Reinforcement Learning from Human Feedback (RLHF) is designed to address this issue, aiming to better align LLM behavior with user intent and expectations.6 Its core goal is to make models "helpful" (assisting users in completing tasks), "honest" (not fabricating information or misleading users), and "harmless" (not causing physical, psychological, or social harm to people or the environment).The InstructGPT model proposed in this paper demonstrates the effectiveness of RLHF. Trained via RLHF, InstructGPT models, even with significantly fewer parameters than GPT-3 (e.g., 1.3 billion parameter InstructGPT compared to 175 billion parameter GPT-3), produced outputs that were more favored by human labelers.6 This indicates that effective fine-tuning can, to some extent, compensate for differences in model scale, or that even medium-sized models, if they have sufficient capacity, can benefit significantly from RLHF.RLHF typically employs a three-stage method 6:
1. Collect Demonstration Data and Train a Supervised Policy (SFT): First, human labelers write high-quality demonstration answers to a series of prompts. These prompts can be written by the labelers themselves or sourced from actual users (e.g., submitted via an API). Then, using these "prompt-answer" pairs, a pre-trained LLM (like GPT-3) is initially fine-tuned via supervised learning. This SFT model provides a good starting point for the subsequent RLHF process.
2. Collect Comparison Data and Train a Reward Model (RM): Next, for the same set of prompts, the SFT model (or other models) generates multiple different answers (typically 4 to 9). Human labelers then rank these answers, indicating which are better and which are worse. This ranking data is used to train a Reward Model (RM). The RM's input is a prompt and one of the model's answers, and its output is a scalar reward value reflecting the human preference for that answer. The RM's goal is to learn to predict how humans would evaluate different model outputs.
3. Optimize a Policy Against the Reward Model Using PPO (Reinforcement Learning): Finally, the trained reward model is used as the reward function in a reinforcement learning environment. The model obtained from the SFT stage serves as the initial policy, and reinforcement learning algorithms like Proximal Policy Optimization (PPO) are used to further fine-tune this policy model. The optimization goal is to maximize the cumulative reward obtained from the RM, i.e., to make the policy model generate answers that receive higher scores from the RM (and thus are more aligned with human preferences). To prevent the policy model from over-optimizing the RM and deviating too far from the original language model's distribution (which could lead to unnatural or repetitive content), a penalty term, such as the KL divergence from the SFT model's output, is usually introduced.
Key findings of RLHF include 6:
- Significant Improvement in Instruction Following and Output Quality: InstructGPT's outputs were consistently preferred over original GPT-3 in human evaluations.
- Improved Truthfulness: InstructGPT significantly reduced the frequency of fabricating information when generating factual content. For example, on the TruthfulQA benchmark, its frequency of generating truthful and informative answers was about twice that of GPT-3.
- Some Reduction in Toxic Content Output: When prompted to be respectful, InstructGPT generated about 25% less toxic output than GPT-3.
- Minimizing "Alignment Tax": The RLHF process can sometimes lead to a decrease in model performance on some standard NLP benchmarks (like SQuAD, DROP), known as the "alignment tax." Research found that by mixing a portion of gradients aimed at maximizing pre-training data likelihood (called PPO-ptx) into the PPO updates, this performance regression could be significantly reduced without sacrificing human preference scores.
- Generalization to Unseen Labeler Preferences: InstructGPT models not only learned the preferences of specific labelers in the training data but also generalized well to the preferences of "out-of-distribution" labelers who did not participate in training data annotation. This suggests that the model learned more universal human preference patterns rather than just overfitting the training set.

RLHF, by introducing human preference rankings of model outputs as a learning signal, enables models to learn more nuanced characteristics that align with human values. The reward model acts as a proxy for human preferences, guiding the policy model to search for better solutions in the vast output space. Therefore, RLHF is a key technology for bridging the gap between LLM capabilities and human expectations, especially in open-ended generation and dialogue scenarios where output quality requirements are high.To more clearly understand these two fine-tuning techniques, SFT (instruction tuning) and RLHF, the following table provides a comparison:Table 2: Comparison of SFT (Instruction Tuning) and RLHF Fine-tuning Methods

Aspect	SFT (Instruction Tuning)	RLHF
Objective	Enable the model to understand and execute instructions given in natural language form, improving zero-shot/few-shot generalization.	Make model outputs more aligned with human preferences, values, and expectations (e.g., more helpful, truthful, harmless).
Input Data	Large number of "instruction-desired output" pairs (demonstration data).	Human preference ranking data for multiple model outputs (comparison data).
Training Process	Supervised learning, directly optimizing the probability of the model generating the desired output for a given instruction.	Typically three stages: 1. SFT warm-up; 2. Train Reward Model (RM) to learn human preferences; 3. Use RM as a reward signal to optimize the language model policy via reinforcement learning (e.g., PPO).
Main Output/Impact	Model gains initial ability to follow instructions, shows good zero-shot performance on unseen tasks.	Model output quality (subjective feel, safety, truthfulness, etc.) is significantly improved, better aligning with human intent.
Advantages	Conceptually relatively simple, direct training process, effectively teaches the model "what to do."	Can learn complex, hard-to-define human preferences, addressing subjectivity issues that SFT struggles with.
Limitations/Challenges	Relies on high-quality, diverse instruction data; may not handle output subjectivity and nuances well.	High data annotation cost (requires human comparison ranking); complex training process (involves training and coordinating multiple models); reward model may be exploited or biased (reward hacking).

In summary, SFT and RLHF play different but complementary roles in the LLM training pipeline. SFT usually serves as the first step, teaching the model "what to do" through instruction data, endowing it with basic instruction-following capabilities and the potential to generalize to new tasks. RLHF then builds upon this foundation, further teaching the model "how to do it better" through human preference data, making its outputs closer to complex human expectations in terms of quality, style, and safety. In practice, these two techniques are often used in combination to achieve optimal model performance and alignment.

Chapter 3: LLM's Leap into Multimodality: Exploring Vision-Language Models

After mastering powerful text understanding and generation capabilities, Large Language Models (LLMs) are rapidly expanding into the multimodal domain, with Vision-Language Models (VLMs or VLLMs) being one of the fastest-growing and most eye-catching directions. VLLMs aim to give LLMs the ability to "see" the world, enabling them to understand image content and associate it with language information, thereby completing more complex cognitive tasks. This chapter will use MiniGPT-4 as an example to explore how LLMs extend into the visual domain.

Core Paper: MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS (arXiv:2304.10592)
3.1 MiniGPT-4: Architectural Design and Visual Information Fusion
The motivation behind MiniGPT-4 was to investigate a core question: whether the remarkable multimodal capabilities demonstrated by recent models like GPT-4 primarily stem from their use of more advanced and powerful LLMs as their "brains".1 If this hypothesis holds true, then by efficiently "feeding" visual information to an advanced LLM, it might be possible to replicate similar advanced multimodal capabilities at a lower cost.
MiniGPT-4's architectural design embodies the idea of parameter-efficient modality alignment 1:
1. Frozen Visual Encoder: MiniGPT-4 adopts the same pre-trained visual components as the BLIP-2 model, specifically a ViT-G/14 visual Transformer from EVA-CLIP and a Q-Former network. Both components remain frozen (parameters are not updated) during MiniGPT-4's training process. They are responsible for extracting deep visual features from input images.
2. Single Linear Projection Layer: This is the only trainable component in the MiniGPT-4 architecture. It acts as a bridge, projecting (or aligning) the visual features from the Q-Former (the paper mentions Q-Former outputs 32 visual query tokens 13) into the word embedding space of the chosen LLM. In this way, visual information is transformed into a "linguified" representation that the LLM can understand.
3. Frozen Advanced Large Language Model: MiniGPT-4 uses Vicuna as its language processing core. Vicuna is an LLM built upon LLaMA and fine-tuned with instructions, reportedly achieving 90% of ChatGPT's quality.1 In MiniGPT-4's training, Vicuna's parameters also remain frozen. Vicuna's context length is limited to 2048 tokens (including input and output).13

The core idea of MiniGPT-4 can be summarized as: using a lightweight, trainable projection layer to connect the output of a powerful, pre-trained visual encoder with the input space of an advanced, pre-trained LLM. This design significantly reduces the training cost and technical barrier for building VLLMs because it avoids end-to-end joint fine-tuning of the massive visual encoder and LLM. This allows researchers to more quickly experiment with different component combinations and alignment strategies. This "lightweight adapter" or "projection layer" approach has also become a strategy adopted by many subsequent VLLMs (such as Video-LLaMA, discussed later in this report), for example, using a Q-Former plus a projection layer.

3.2 Training Strategy and Emergence of Advanced Multimodal Capabilities
To effectively align visual information with the LLM and elicit advanced multimodal capabilities, MiniGPT-4 employs a carefully designed two-stage training strategy 1:
1. First Stage (Pre-training):
  - Objective: To initially learn the alignment between visual information and the language model. In this stage, the model's goal is to enable the LLM to generate relevant text descriptions based on input visual features. The output of the linear projection layer is treated as a "soft prompt," guiding the LLM to generate text consistent with the image content.
  - Training Data: A mixed dataset of image-text pairs was used, including images and their corresponding short descriptions from large-scale datasets like LAION, Conceptual Captions, and SBU. Approximately 5 million image-text pairs were involved in this stage.
  - Training Details: Only the linear projection layer was trained; the visual encoder and LLM remained frozen. Training was conducted for about 20,000 steps with a batch size of 256.
  - Interim Results and Issues: After completing the first stage of training, MiniGPT-4 could make reasonable sense of image content. However, its generated language outputs were often not natural or fluent, frequently exhibiting issues like word repetition, incomplete sentences, incoherent content, or content not closely related to the topic. This indicated that merely aligning with a large but relatively "noisy" or "simple" set of image-text pairs was insufficient to support high-quality visual dialogue capabilities.
2. Second Stage (Fine-tuning):
  - Objective: To significantly improve the naturalness and reliability of the generated language, enhance user experience, and address the language quality issues exposed in the first stage.
  - Catalytic Role of High-Quality Instruction/Description Data: The researchers found that high-quality instruction or detailed description data is crucial for improving the output quality and usability of VLLMs. Due to the lack of readily available, high-quality instruction fine-tuning datasets for the vision-language domain, the MiniGPT-4 team meticulously constructed a small-scale but high-quality "image-detailed description" dataset.
    - Dataset Construction Process: First, using the model trained in the first stage, preliminary detailed descriptions were generated for about 5,000 images randomly selected from the Conceptual Caption dataset. If the generated descriptions were too short (e.g., less than 80 tokens, a threshold based on empirical observation as descriptions shorter than this tended to be incomplete 13), additional prompts (e.g., "###Human: Continue ###Assistant:") were used to guide the model to continue generating. Subsequently, ChatGPT was used to polish and correct these automatically generated descriptions, such as removing repetitive content, meaningless characters, non-English sentences, etc. Finally, after manual verification and screening, about 3,500 high-quality image-detailed description pairs were obtained.
  - Fine-tuning Process: This carefully constructed high-quality dataset was used to fine-tune the model obtained from the first stage. A predefined dialogue template (e.g., ###Human: \<Img>\<ImageFeature>\</Img>\<Instruction>###Assistant:) was used during fine-tuning, where \<Instruction> was a randomly sampled instruction like "Describe this image in detail."
  - Efficiency: This fine-tuning stage was very efficient, requiring only about 400 training steps with a batch size of 12, and could be completed in about 7 minutes on a single A100 GPU.

After these two training stages, especially the fine-tuning with high-quality data in the second stage, MiniGPT-4 demonstrated various impressive advanced multimodal capabilities, many of which were similar to GPT-4's demonstrations and were difficult for traditional VLLMs to achieve. These capabilities include: generating very detailed and complex image descriptions; generating functional website code from hand-drawn sketches; explaining humorous elements and underlying meanings in images or memes; generating detailed cooking recipes from food photos; creating stories or poems based on given images; and writing advertising copy for products in images.1 The emergence of these capabilities is largely attributed to the advanced nature of its LLM "brain" (Vicuna). Once visual information is effectively "translated" and fed into the LLM, the LLM's inherent advanced cognitive abilities such as reasoning, generation, knowledge association, and even a degree of creativity can be applied to visual content. This corroborates the view that continuous progress in LLMs will directly drive the upper limits of VLLM capabilities.

3.3 Key Findings in Image Understanding and Generation Tasks
The MiniGPT-4 research yielded several key findings regarding VLLMs:
- Importance of Aligning Visual Features with Advanced LLMs: The experimental results strongly demonstrated that by appropriately aligning visual features with an advanced LLM (even with a simple linear projection), it is indeed possible to unlock the LLM's existing powerful language reasoning and generation capabilities and effectively apply them to vision-related tasks.1 This provides an efficient path for building powerful VLLMs.
- Extreme Necessity of Second-Stage Fine-tuning: The research clearly indicated that training for visual-language alignment using only a large number of typically short and noisy image description pairs (as used in the first stage) is insufficient to produce high-quality, natural dialogue capabilities and language outputs. Introducing a small-scale but high-quality dataset containing rich, detailed descriptions for second-stage fine-tuning is crucial for significantly improving the model's generation reliability, language fluency, and overall usability.1 This aligns with the emphasis on data quality in SFT and instruction tuning in the pure text LLM domain: at certain stages, the "quality" of data is often more critical than "quantity," especially in shaping the model's ability to follow complex instructions and generate nuanced outputs that meet human expectations.
- High Training Efficiency: One of MiniGPT-4's core innovations is its parameter efficiency. Since only a very small linear projection layer (about 5 million parameters) needs to be trained, while the massive visual encoder (billions of parameters) and LLM (e.g., Vicuna-13B has 13 billion parameters) remain frozen, its training cost is relatively low. The first stage of pre-training takes about 10 hours (using 4 A100 GPUs), and the second stage of fine-tuning takes only a few minutes.1
- Capability Boundaries and Limitations 1:
  - Hallucination: MiniGPT-4, like its underlying LLM, also suffers from hallucination. That is, the model sometimes generates objects that are not present in the image or describes details inconsistent with the image content. The longer the generated text, the higher the probability of hallucination seems to be.
  - Insufficient Spatial Localization Understanding: The model may perform poorly in understanding and describing the precise spatial relationships of objects in an image. For example, it might struggle to accurately indicate the specific location of a window in a room. This might be related to the lack of aligned image-text pairs specifically targeting spatial understanding in its training data.
  - Balancing Finer-Grained Recognition Tasks: The discussion in the paper mentioned that how to achieve a better balance between cognition-related tasks (such as advanced reasoning, story generation) and fine-grained recognition tasks (such as precise object recognition and localization) is a direction worth exploring in future research.13

MiniGPT-4's exploration provides an important reference paradigm for the VLLM field, demonstrating how to quickly build models with advanced multimodal capabilities under limited resources through clever architectural design and training strategies. It also reveals the decisive impact of data quality and the LLM's own capabilities on VLLM performance.

Chapter 4: LLM Innovation in Audio Interaction: Real-time and Synchronicity

With breakthroughs in LLMs in text and vision, extending their capabilities to audio interaction, especially achieving more natural and real-time voice dialogue, has become a new research frontier. Human voice dialogue is highly dynamic and synchronous, posing new challenges for traditional, often asynchronous and turn-based LLMs. This chapter will explore two representative works: SyncLLM, dedicated to achieving full-duplex synchronous dialogue; and Mini-Omni, pursuing "listen, talk, and think while streaming" capabilities.

4.1 Synchronous LLMs: Towards Full-Duplex Dialogue Interaction
- Core Paper: Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents (arXiv:2409.15594)
Problem Background: Traditional human-computer dialogue systems, including many LLM-based systems, mostly adopt a "half-duplex" interaction mode. In this mode, the user and the machine take turns speaking; one party finishes before the other can respond. This is far from the "full-duplex" nature of natural human conversation. In human dialogue, both parties can speak and listen simultaneously, with rapid turn-taking, speech overlaps, and backchannels like "uh-huh" or "yeah".8 These synchronous dynamics make conversations fluid and natural, conveying rich interaction information. However, pre-trained LLMs themselves lack a concept of "time," making it difficult for them to directly model this complex synchronicity, which has become a key bottleneck in achieving natural audio LLM interaction.The core idea of SyncLLM is to empower LLMs with the ability to perceive and process temporal information, enabling them to participate in full-duplex spoken dialogue 8:
1. Time Information Integration: SyncLLM integrates temporal information into the LLM (Llama3-8b was used in this study) through a novel mechanism, allowing it to run synchronously with a real-world clock. This is achieved by periodically inserting special "synchronization tokens" into the model's input and output sequences. These synchronization tokens provide a common time frame for both parties in the dialogue.
2. Full-Duplex Modeling: The model is trained to predict the speech units (e.g., HuBERT tokens) of both dialogue participants (the user and the LLM itself) within each time chunk. By simultaneously predicting both parties' speech, the model can learn and generate interaction sequences containing full-duplex dialogue phenomena such as overlaps, backchannels, and rapid turn-taking. During actual interaction with a user, the model replaces its own predictions of the user's speech with the user's actual voice input.
3. Latency Tolerance: Considering potential delays from factors like network transmission, SyncLLM is designed to predict speech units for both speakers for a short future period (e.g., 160-240 milliseconds). This predictive capability allows the model to maintain dialogue fluency even with a certain degree of latency, similar to how humans anticipate responses in conversation.
Due to the extreme scarcity of high-quality, large-scale real full-duplex speech dialogue data, directly training models with it is very difficult.8 To overcome this challenge, SyncLLM employs an innovative three-stage training method 8: This method primarily relies on a large amount (about 212,000 hours) of synthetic spoken dialogue data generated from pure text dialogue data for initial training. This synthetic data is endowed with speech attributes through text-to-speech (TTS) technology and may simulate some dialogue dynamics. Then, a relatively small amount (about 2,000 hours) of real-world spoken dialogue data is used for fine-tuning to enable the model to learn more realistic speech features and interaction patterns. This strategy effectively utilizes easily accessible text data to generate large-scale training material, significantly reducing reliance on scarce real data, which is an important way to accelerate the iteration and development of audio LLM models.Key findings of SyncLLM include 14:
- In terms of dialogue content Meaningfulness, SyncLLM significantly outperformed the then state-of-the-art open-source full-duplex voice model dGSLM, while maintaining comparable or even better levels of turn-taking Naturalness.
- The model demonstrated good generalization to out-of-distribution data; for example, a model trained on the Fisher corpus also performed well on the Candor test set.
- The model can effectively handle network latencies up to 200 milliseconds and maintain dialogue coherence in simulated LLM-to-LLM interactions.
- Solving the time synchronization problem is core to achieving truly full-duplex, low-latency, human-like voice interaction, which is crucial for the practical application of audio LLMs.
- 4.2 Streaming Audio Dialogue Models: Achieving "Listen, Talk, and Think While Streaming"
- Core Paper: Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (arXiv:2408.16725)
Problem Background: Although some models have attempted to achieve voice interaction, current open-source academic models still have a significant gap in achieving truly real-time, naturally fluent voice dialogue like GPT-4o. These models often rely on external Text-to-Speech (TTS) systems for speech synthesis, which introduces non-negligible latency, greatly diminishing the interaction experience.10 Achieving "listen, talk, and think while streaming" interaction requires the model to simultaneously process the input audio stream, perform internal language and acoustic reasoning, and generate the output audio stream in real-time, posing new challenges to model architecture and decoding strategies.The core contribution of Mini-Omni lies in its commitment to creating an open-source model capable of real-time, end-to-end voice interaction 10:
1. First Open-Source End-to-End Real-time Speech Interaction Model: Mini-Omni is proposed as the first open-source, end-to-end multimodal large language model with audio input and streaming audio output capabilities. This means it does not rely on external ASR (Automatic Speech Recognition) or TTS modules; all processing is done within a unified model.
2. Text-instructed Parallel Generation: To avoid sacrificing the model's text reasoning ability while streaming audio output, Mini-Omni proposes a text-instructed speech generation method. In this method, the Transformer model is designed to simultaneously produce audio tokens and text tokens. The audio output is delivered in real-time through an internal text-to-speech synthesis mechanism, ensuring low first-response latency while leveraging the model's strong reasoning capabilities in the text domain.
3. Batch Parallel Decoding: Furthermore, to improve the reasoning quality of the model during streaming audio output (as complex reasoning directly in the audio modality can be more challenging), Mini-Omni introduces a batch parallel decoding strategy. Specifically, for a single user input, the model internally processes two tasks in parallel: one task requires generating both text and audio responses, and the other task only requires generating a text response. The text content generated by the second (text-only) task is then embedded into the corresponding text token positions of the first task, while the audio stream of the first task is generated and output based on the pure text content from the second task. This method cleverly "transfers" the model's stronger text reasoning ability to the audio output modality with minimal resource overhead.
4. "Any Model Can Talk" Method: This is a training methodology aimed at quickly endowing existing LLMs with voice interaction capabilities with minimal modifications to the original model and minimal training data requirements. It typically includes three stages: modality alignment (training adapters to enable the LLM to understand and generate speech), adaptation training (training the LLM's text capabilities to adapt to audio input while adapters are frozen), and multimodal fine-tuning (fine-tuning the entire model).
5. VoiceAssistant-400K Dataset: Addressing potential shortcomings of existing general question-answering datasets for training voice assistants (e.g., tone, style may not match), the Mini-Omni team also synthesized a dedicated dataset called VoiceAssistant-400K for fine-tuning models to achieve a better voice assistant interaction style.
Key findings of Mini-Omni include 11:
- The model successfully achieved real-time dialogue capabilities, with its audio output quality reportedly comparable to common TTS systems.
- Through its unique parallel generation and decoding strategies, the model can effectively preserve the original LLM's language understanding and reasoning capabilities while adding speech capabilities, avoiding significant performance degradation.
- Batch parallel decoding was proven to be an effective means of enhancing the model's reasoning ability in a new modality (audio output).
- The research also showed that even relatively small models (e.g., 0.5B parameters), through efficient method design and training, can handle complex real-time dialogue tasks, which is important for deploying models in resource-constrained environments.

The work of SyncLLM and Mini-Omni collectively reveals the development direction of LLMs in the audio interaction field: pursuing a more natural (full-duplex, synchronous) and more real-time (streaming processing, low latency) human-like dialogue experience. They also highlight common challenges in this field, such as temporal modeling, data scarcity, and the complexity of maintaining high-quality reasoning while ensuring real-time performance. These studies provide valuable insights and technical reserves for the future design and optimization of audio LLMs.

Chapter 5: Application and Challenges of LLMs in Complex Video Understanding

Video, as an extremely information-dense multimodal data form containing dynamic visual and auditory content, poses severe tests for machine understanding capabilities. Extending the capabilities of Large Language Models (LLMs) to the video domain—enabling them to understand video content, describe dynamic scenes, answer related questions, and engage in dialogue with humans about video content—is an important research direction in multimodal artificial intelligence. This chapter will use Video-LLaMA as an example to explore the applications and challenges of LLMs in complex video understanding.

Core Paper: Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding (arXiv:2306.02858)
5.1 Video-LLaMA: Instruction-Tuned Model for Audio-Visual Fusion
Video-LLaMA aims to build a system that allows LLMs to simultaneously understand both visual and auditory content within videos and to engage in meaningful dialogue with humans based on this understanding.12 The core difference between video and static images or isolated audio clips lies in its inherent temporal dimension and dynamic characteristics. Therefore, an effective video understanding model must address two core challenges 12:
1. Capturing Temporal Changes in Visual Scenes: Events, actions, and scene transitions in videos unfold over time. The model needs to be able to perceive and understand these temporal dynamic relationships.
2. Integrating Audio-Visual Signals: Many videos (such as movies, lectures, vlogs, etc.) contain audio information (like speech, sound effects, music) that is closely related to the visual content. The model needs to be able to effectively fuse these two modalities of information to form a comprehensive understanding of the video content.

To achieve this goal, Video-LLaMA adopts an Instruction Tuning strategy. This means the model is trained to follow instructions given in natural language form. These instructions guide the model to focus on specific aspects of the video, perform particular analytical tasks (such as description, question answering), or generate responses in a specific format. The introduction of instruction tuning aims to improve the model's adaptability to tasks and the quality and relevance of its output content.

5.2 Multi-Branch Cross-Modal Framework and Information Processing
Video-LLaMA's architectural design embodies the idea of "combinatorial innovation" using pre-trained models. It does not build a massive end-to-end model from scratch but cleverly combines multiple pre-trained expert models that have already demonstrated excellence in their respective fields, connecting them through lightweight adapter modules. Its core architecture includes two main branches: a vision-language branch and an audio-language branch, which are responsible for processing the visual and auditory information in the video, respectively, and converting it into representations that the LLM can understand.12
- Vision-Language Branch:
  - Components: This branch includes a frozen pre-trained image encoder (specifically, the ViT-G/14 visual Transformer and Q-Former used by BLIP-2), a learnable position embedding layer, a Video Q-Former (sharing architecture with BLIP-2's Q-Former), and a linear projection layer.
  - Processing Flow: First, video frames sampled from the video pass through the frozen image encoder to extract high-level visual features. Then, temporal order information is injected into these frame features via the position embedding layer. These temporally encoded frame representations are fed into the Video Q-Former, whose role is to aggregate visual information from different frames and generate fixed-length video embedding vectors. Finally, a linear layer projects these video embedding vectors into the same dimensional space as the LLM's text embeddings, forming "video query vectors." These video query vectors are concatenated with the user's text instruction embeddings, serving as a "video soft prompt" input to the frozen LLM to guide it in generating text responses based on the video content.
- Audio-Language Branch:
  - Components: This branch includes a pre-trained audio encoder (the ImageBind model was chosen), a learnable position embedding layer, an Audio Q-Former, and a linear projection layer. A key feature of ImageBind is its ability to align embeddings from different modalities (including image, text, audio, etc.) into a shared semantic space.
  - Processing Flow: First, several audio clips (e.g., 2 seconds each) are uniformly sampled from the video's audio track. These audio clips are converted into spectrograms and then mapped to dense audio feature vectors by the ImageBind audio encoder. Similar to the Video Q-Former, the Audio Q-Former also processes the temporal information of audio clips by adding learnable position embeddings and fuses features from different audio clips to generate fixed-length audio feature representations. Finally, a linear layer projects these audio features into the LLM's embedding space, forming "audio query vectors."
Video-LLaMA employs a staged multi-branch cross-modal training strategy 12:
1. First Stage (Pre-training):
  - Vision-Language Branch Pre-training: Trained using large-scale video-text description datasets (e.g., Webvid-2M, containing short videos and their text descriptions) and image-text description datasets (e.g., CC595k). The training task is video-to-text generation, where the LLM generates corresponding text descriptions given the visual representation of the video. This stage aims to imbue video features with extensive visual knowledge.
  - Audio-Language Branch Pre-training: Due to the relative scarcity of high-quality, large-scale audio-text paired data, Video-LLaMA adopts a clever workaround here. It does not use direct audio-text data to train the audio branch but instead uses the same visual-text data as the vision branch for training. This is feasible because its chosen ImageBind audio encoder can align embeddings from different modalities (including visual and audio) into the same shared semantic space. When the LLM learns to understand visual representations in this shared space through visual-text data, it indirectly gains the ability to understand audio, as audio representations also reside in the same space, even without direct audio-text training data. This strategy of knowledge transfer using a shared embedding space is significant for addressing the scarcity of multimodal data.
2. Second Stage (Fine-tuning):
  - Vision-Language Branch Fine-tuning: Fine-tuned on various high-quality instruction-following datasets, which may come from image understanding (e.g., detailed description data from MiniGPT-4, instruction data from LLaVA) and video understanding (e.g., video instruction data from Video-Chat) domains. The purpose of this stage is to enhance the model's ability to follow complex instructions and its detailed understanding of image and video content.
  - Audio-Language Branch Fine-tuning: The paper states that although the audio branch was not explicitly trained with audio-text data during the pre-training phase, Video-LLaMA exhibits remarkable zero-shot audio understanding capability during inference, thanks to the shared embedding space provided by ImageBind. This means the model can be directly applied to tasks requiring understanding of audio content without additional specialized fine-tuning for audio.
  - 5.3 Evaluation of Video Content Perception, Understanding, and Interaction Capabilities
    Through experimental evaluation, Video-LLaMA demonstrated various capabilities in audio-visual content understanding and dialogue 12:
  - Audio-visual Integration Perception: The model can simultaneously understand auditory and visual information in videos and accurately answer questions related to both modalities in videos containing audio. For example, if a person is speaking in a video while there is a specific background sound, the model might need to combine both to answer a question.
  - Temporal Dynamics Capture: The model can successfully identify actions and events that occur over time in videos. For instance, it can describe the sequence of actions a girl performs in a video or determine the moving direction of an object (like a boat).
  - Static Image Understanding: In addition to videos, Video-LLaMA can also perceive and understand static images, including understanding abstract concepts in images (e.g., judging if a scene is "unusual"), providing detailed descriptions, and even associating image content with human emotions or interactions (like friendly interaction between a dog and a human).
  - Common-knowledge Concept Recognition: The model shows the ability to recognize common knowledge concepts in visual signals, such as identifying famous landmarks, well-known public figures, or fictional characters, and can engage in common-sense question-answering around these concepts.
Although Video-LLaMA has made significant progress, researchers also pointed out its limitations 12:
- Limitations in Perception Capability: The model's perception and understanding capabilities are, to some extent, limited by the quality and scale of its training datasets.
- Challenges in Processing Long Videos: For videos with very long durations, the model may face difficulties in capturing and maintaining long-range dependencies and understanding the complete narrative of complex events. This requires models with stronger temporal modeling capabilities and long-range memory.
- Inheriting LLM's Hallucination Problem: Similar to other LLM-based models, Video-LLaMA may also experience hallucinations, i.e., generating information inconsistent with the video content or fabricated out of thin air.

Video-LLaMA, as a prototype for audio-visual AI assistants, demonstrates the immense potential of LLMs in understanding complex dynamic multimodal scenarios. Its modular design, effective utilization of pre-trained models, and clever application of shared embedding spaces provide valuable lessons for the development of more powerful video understanding models in the future. However, the complexity of video understanding also means that models must not only process multimodal inputs but also deeply understand temporal dynamics, contextual relationships, and long-range causal relationships, which remain key areas for future research breakthroughs.

Chapter 6: Comprehensive Insights, Current Challenges, and Future Prospects

Following the in-depth discussions in the preceding chapters on the foundations of open-source LLMs, key fine-tuning technologies, and the applications of LLMs in multimodal domains such as vision, audio, and video, this chapter aims to comprehensively review these technical paths. It will reveal their intrinsic connections and synergistic potential, summarize the common challenges currently faced in LLM training and multimodal applications, and provide an outlook on future research directions and technological breakthroughs.

6.1 Intrinsic Connections and Synergistic Potential Among Various Technical Paths
Observing the research analyzed in this report, it can be found that they are not developing in isolation but are interconnected and mutually reinforcing, collectively forming the evolutionary landscape of LLM technology.
- Open-Source Foundational Models (e.g., Llama 2) are Important Cornerstones for Subsequent Innovation: The emergence of open-source LLMs like Llama 2 2 has provided academia and industry with accessible and customizable high-quality foundational models. This has greatly lowered the R\&D threshold, enabling the validation and improvement of fine-tuning techniques like SFT and RLHF, as well as the development of subsequent multimodal models like MiniGPT-4 1 and Video-LLaMA 12, to build upon the work of giants. Many studies explicitly build upon or draw design inspiration from the Llama series models, demonstrating the empowering effect of open source on the entire ecosystem.
- SFT and RLHF are a Golden Combination for Enhancing LLM Capabilities and Alignment: Supervised Fine-Tuning (SFT), especially instruction tuning, endows LLMs with basic instruction-following capabilities and zero-shot generalization ability on unseen tasks.5 Reinforcement Learning from Human Feedback (RLHF) builds upon this by learning human preferences, further aligning the model's behavior with more complex and subjective human values (such as helpfulness, truthfulness, harmlessness).6 These two techniques are often used in combination; for example, the training process of InstructGPT clearly reflects the process of RLHF optimization after SFT warm-up. They complement each other, jointly improving the model's practicality and reliability.
- LLMs are the Core "Brain" of Multimodal Intelligence: Whether it's MiniGPT-4 processing static images 1, SyncLLM 14 and Mini-Omni 11 handling dynamic audio interactions, or Video-LLaMA 12 understanding complex audio-visual streams, their core relies on a powerful Large Language Model responsible for final semantic understanding, logical reasoning, and content generation. The realization of multimodal capabilities essentially involves "translating" raw information from different modalities (such as visual features, audio features) into a "language" that LLMs can understand, through specific encoder and adapter modules, and then leveraging the LLM's powerful cognitive abilities for processing. This model reflects a mainstream approach in current multimodal expansion, building upon existing powerful pre-trained models for efficient extension.
- Universality and Importance of Instruction Tuning: From the FLAN model in the pure text domain achieving zero-shot learning breakthroughs through instruction tuning 5, to MiniGPT-4 in the visual domain using high-quality description data for second-stage fine-tuning to improve output quality 1, and Video-LLaMA in the video domain being an instruction-tuned audio-visual language model 12, all highlight the importance of "instruction" as an efficient interaction and fine-tuning paradigm. Through instructions, humans can more flexibly guide the model's behavior, and the model can better adapt to diverse task requirements.
- Modality Alignment Faces Common Challenges and Spurs Similar Solutions: How to effectively align information from different modalities (such as visual features, audio features) with the LLM's representation space is a core technical problem faced by all multimodal LLMs. Both MiniGPT-4 and Video-LLaMA have adopted similar strategies, namely, using frozen, pre-trained modal encoders and achieving this alignment through one or more trainable lightweight adapter layers (such as linear projection layers, Q-Formers). This parameter-efficient alignment method has become a trend in current multimodal LLM architectural design.
6.2 Common Challenges and Limitations in LLM Training and Multimodal Applications
Although LLMs and their multimodal applications have made remarkable progress and demonstrated powerful emergent capabilities, it is essential to clearly recognize that current technology still faces numerous common challenges and inherent limitations. These are bottlenecks constraining further model development and widespread application.
- Hallucination Problem: This is a long-standing and difficult-to-eradicate problem in the LLM field. From early GPT-3 6 to recent multimodal models like MiniGPT-4 1 and Video-LLaMA 12, models may generate content that is untruthful, inconsistent with input information, or even fabricated out of thin air. In multimodal scenarios, hallucination might manifest as describing objects or events that do not exist in the image or video.
- Data Scarcity and Quality Bottlenecks:
  - High-quality instruction data is crucial for the effectiveness of SFT, as FLAN's success relied on a large and diverse set of NLP task instructions.5
  - RLHF requires a large amount of expensive human preference annotation data.6
  - This problem is particularly prominent in the multimodal domain. Well-paired, semantically rich multimodal datasets covering diverse scenarios and instructions are even scarcer. For example, MiniGPT-4 had to specifically construct a small-scale but high-quality detailed description dataset for second-stage fine-tuning to improve output quality.1 SyncLLM heavily relies on synthetic spoken dialogue data to compensate for the lack of real data.14 Video-LLaMA's audio branch cleverly uses visual-text data, leveraging ImageBind's shared embedding space to achieve audio understanding, thereby circumventing the lack of large-scale audio-text data.12
- Model Scale and Computational Resource Requirements: Although researchers are exploring parameter-efficient fine-tuning methods (such as LoRA 15, training only projection layers, etc.), the pre-training of foundational LLMs and inference deployment in many scenarios still require enormous computational resources and storage space. Meanwhile, some studies (like FLAN 5) indicate that the benefits of methods like instruction tuning are often positively correlated with model scale; smaller models may not fully benefit or may even experience performance degradation.
- Alignment Tax: To make model behavior more aligned with human expectations (e.g., through RLHF), a cost is sometimes incurred in terms of the model's general capabilities. That is, the alignment process may lead to a decrease in performance on certain standard NLP benchmark tasks.6 How to achieve the optimal balance between alignment effectiveness and general capabilities is an issue requiring careful consideration.
- Real-time Performance and Latency Challenges: For multimodal applications requiring real-time interaction, such as image-text dialogue, and especially for streaming interaction scenarios like audio and video, model inference latency is a key performance bottleneck. SyncLLM aims to address latency through time synchronization and prediction 14, while Mini-Omni pursues extremely low first-response latency through strategies like parallel decoding.11 These efforts reflect the industry's high attention to real-time performance.
- Complex Reasoning and Insufficient Robustness:
  - Although LLMs have shown preliminary reasoning abilities, they are still inadequate in handling complex reasoning tasks that require deep logic, common sense, or specific domain knowledge. For example, MiniGPT-4 has difficulties in understanding precise spatial localization.1
  - Video-LLaMA also faces challenges in processing long-duration videos and understanding long-range dependencies and causal relationships in complex events.12
  - Furthermore, the model's robustness to input noise, interference, or adversarial attacks is an important issue. A related study (though not one of the seven core papers analyzed in this report) explored the robustness of vision-language models to common image corruptions (such as blur, noise), finding that different types of corruption affect model performance differently.16
- Interpretability and Controllability Challenges: LLMs are often considered "black box" models, and their internal decision-making processes are difficult to fully understand and explain. This lack of interpretability persists in multimodal scenarios and may even be more complex. At the same time, how to precisely control the model's output content, style, level of detail, etc., is also key to improving user experience and application reliability.

The following table summarizes the core capabilities and limitations of several major multimodal LLMs discussed in this report:Table 3: Summary of Multimodal LLM Capabilities and Limitations

Model	Main Modality	Core Capabilities/Innovations	Limitations/Challenges Mentioned in Paper
MiniGPT-4	Vision	Efficiently aligns vision with advanced LLM via a single projection layer; demonstrates various advanced visual QA & generation capabilities (e.g., website generation, recipe generation, poetry creation); two-stage training improves output quality.	Inherits LLM hallucination; weak in precise object spatial localization; requires high-quality detailed description data for fine-tuning.
SyncLLM	Audio (Synchronous Dialogue)	Achieves full-duplex spoken dialogue; integrates time information into LLM, synchronizing with real clock; tolerates latency by predicting future speech units; uses synthetic data to overcome real data scarcity.	Relies on large amounts of synthetic data; complexity of full-duplex interaction places high demands on model capability.
Mini-Omni	Audio (Streaming Dialogue)	First open-source end-to

引用的著作

scispace.com, 访问时间为六月 3, 2025， https://scispace.com/pdf/minigpt-4-enhancing-vision-language-understanding-with-317bm95a.pdf
[2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models, 访问时间为六月 3, 2025， https://ar5iv.labs.arxiv.org/html/2307.09288
AI-Powered Paper Summarization about the arXiv paper 2307.09288v2, 访问时间为六月 3, 2025， https://www.summarizepaper.com/en/arxiv-id/2307.09288v2/
[2109.01652] Finetuned Language Models Are Zero-Shot Learners - arXiv, 访问时间为六月 3, 2025， https://arxiv.org/abs/2109.01652
arxiv.org, 访问时间为六月 3, 2025， https://arxiv.org/pdf/2109.01652
[2203.02155] Training language models to follow instructions with ..., 访问时间为六月 3, 2025， https://ar5iv.labs.arxiv.org/html/2203.02155
[2304.10592] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - arXiv, 访问时间为六月 3, 2025， https://arxiv.org/abs/2304.10592
arXiv:2409.15594v1 [cs.CL] 23 Sep 2024, 访问时间为六月 3, 2025， https://arxiv.org/pdf/2409.15594?
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - arXiv, 访问时间为六月 3, 2025， https://arxiv.org/abs/2409.15594
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - arXiv, 访问时间为六月 3, 2025， https://arxiv.org/abs/2408.16725
arxiv.org, 访问时间为六月 3, 2025， https://arxiv.org/pdf/2408.16725
[2306.02858] Video-LLaMA An Instruction-tuned Audio-Visual ..., 访问时间为六月 3, 2025， https://ar5iv.labs.arxiv.org/html/2306.02858
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced... - OpenReview, 访问时间为六月 3, 2025， https://openreview.net/forum?id=1tZbq88f27
arxiv.org, 访问时间为六月 3, 2025， https://arxiv.org/pdf/2409.15594
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens *All work performed during internship at Microsoft Research. - arXiv, 访问时间为六月 3, 2025， https://arxiv.org/html/2503.22275
Analysing the Robustness of Vision-Language-Models to Common Corruptions - arXiv, 访问时间为六月 3, 2025， http://www.arxiv.org/abs/2504.13690