The Decoder February 9, 2025 NEUTRAL

How MILS lets LLMs work with images, video and audio

Meta AI researchers and academic partners developed MILS, a system that helps large language models handle images, video, and audio without specialized training. It uses a generator and scorer loop to improve answers step by step, with strong results in image and video description.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

MILS is a technical multimodal capability advance, but the story presents benign captioning and evaluation uses without clear danger or societal degradation.

How MILS lets LLMs work with images, video and audio

Meta AI researchers and their academic partners have introduced MILS (Multimodal Iterative LLM Solver), a system designed to help large language models work with images, video, and audio without specialized training. Instead of changing a model through extensive data training, MILS uses an iterative process that pushes existing models to improve their answers through feedback.

The result is a different route into multimodality. Rather than building every capability directly into one model, MILS combines models in a loop so that proposed answers can be evaluated, refined, and improved.

How MILS works

MILS pairs two AI models with different roles. One is a "generator" that suggests an answer or solution for a task. The other is a "scorer" that checks how well that suggestion works.

The scorer's feedback then guides the generator toward a better response. This process repeats step by step until the system reaches a satisfactory result.

That structure matters because MILS relies on the problem-solving ability already present in large language models. The system does not need to modify model parameters during operation. It instead uses inference, feedback, and selection to move closer to a useful result.

In practical terms, MILS turns multimodal work into a process of proposing, judging, and revising. The source material shows this approach being used across several kinds of data, including images, video, and audio.

Why image description is a strong use case

The system shows particular strength in describing images. In one setup, MILS used Llama-3.1-8B as the generator and CLIP as the scorer. With that pairing, it produced detailed image descriptions that matched or exceeded current leading methods.

That result is notable because CLIP was not specifically trained for this exact task. MILS uses the scorer model's existing ability to evaluate alignment between visual content and text, then channels that feedback into better descriptions.

This makes image captioning a clear example of the system's broader logic. The language model does not have to learn a new visual skill from scratch during operation. It can generate candidate descriptions, receive a signal from another model, and keep improving the wording until it better fits the image.

MILS also supports text-to-image generation by refining text prompts. Better prompts can guide generation more effectively, and the same iterative logic can help search for a stronger prompt before the final image is made.

The system can also handle image editing tasks, including style transfer, by combining AI-generated prompts with image processing tools. In that workflow, MILS is not just describing media. It is helping coordinate the language instructions that drive a visual edit.

Video, audio, and multimodal conversion

MILS extends beyond still images. The source article says the system can also handle video and audio, and that it performed better than existing models at describing video content in tests using the MSR-VTT video dataset.

This points to a larger role for readable text inside multimodal AI systems. Since MILS does not modify model parameters during operation, it can convert different kinds of data into text that a language model can work with.

That creates a bridge between formats. Information from images and audio, for example, can be converted into text, combined, and then converted back into the desired format. The important idea is not that every data type becomes the same. It is that text can act as a common workspace for merging and reasoning across different sources.

The source also notes that the quality of results improves when the system has more potential solutions to work with. Tests indicate that larger generator and scorer models produce more accurate outputs, and that scaling up to larger language models leads to noticeable quality improvements.

What MILS says about the direction of AI assistants

The AI field is moving quickly toward models that can handle multiple kinds of input. This shift is often called multimodality, and it is central to AI assistants that are expected to work in everyday situations.

The source article names several systems in that broader movement. OpenAI's GPT-4o led the way, while open-source alternatives are catching up. Meta's Llama 3.2, Mistral's Pixtral, and DeepSeek's Janus Pro can process images alongside text.

MILS approaches the same problem from another angle. Instead of putting all the training burden on the language model, it moves much of the requirement to a pre-trained scorer model. The language model can then be improved at inference time through a smarter process of generation and evaluation.

That makes MILS part of a broader direction in AI research: improving model behavior through better inference methods, not only by adding more training data. The research team also sees potential for MILS to tackle 3D data processing in the future.

For AI assistants, the implication is straightforward. If systems can translate images, video, and audio into workable text representations, combine that information, and use feedback to improve outputs, they can become more flexible without requiring specialized training for every task. MILS is one example of how that shift may happen.