Snap Inc. is pushing mobile AI image generation closer to what users expect from server-based systems. Its new SnapGen++ model is designed to create high-resolution images directly on smartphones, using a compact diffusion transformer rather than the older U-Net approach used by earlier on-device models.
According to the research paper, SnapGen++ can produce 1024 x 1024 pixel images in just 1.8 seconds on an iPhone 16 Pro Max. That speed matters because the model is not simply making small previews or low-resolution outputs. It is aimed at high-resolution text-to-image generation on mobile hardware.
A server-style architecture moves onto smartphones
The central shift in SnapGen++ is architectural. Earlier on-device models such as SnapGen relied on U-Net designs. SnapGen++ brings a diffusion transformer architecture to smartphones, the same broad class of architecture used by large server models such as Flux and Stable Diffusion 3.
Diffusion transformers combine two ideas that have become important in image generation. They use the diffusion approach for producing images while drawing on transformer strengths, including stronger handling of complex text prompts and efficient scaling. The source describes the result as more coherent and detailed images than U-Net-based predecessors could deliver.
Until now, that kind of architecture has been difficult to run on mobile devices because of its computational demands. Large server models can rely on far more powerful infrastructure. A smartphone has much tighter limits, so the model has to be smaller, faster, and more efficient without giving up too much quality.
The attention method that cuts the workload
The main technical barrier is attention. In diffusion transformers, the amount of computation can rise sharply as image resolution increases. That becomes a serious problem when the target is a 1024 x 1024 pixel image generated directly on a phone.
Snap’s team addressed this with a new attention method. Instead of treating every region of the image in the same full pass, the model combines a broad overview with fine local details. This reduces the amount of work needed while keeping the generated image aligned with the prompt and visual structure.
The performance change is large in the numbers provided. Latency per inference step falls from 2,000 milliseconds to under 300 milliseconds. That reduction is one of the reasons the small version can reach around 1.8 seconds total latency on an iPhone 16 Pro Max when using four inference steps.
One training run, three model sizes
SnapGen++ also uses a training approach called Elastic Training. Instead of training separate systems for different hardware targets, a single run creates three variants of the model. The goal is to make the same underlying system adaptable across budget devices, high-end smartphones, and servers.
The three versions are:
- Tiny: 0.3 billion parameters, aimed at budget Android devices.
- Small: 0.4 billion parameters, intended for high-end smartphones.
- Full: 1.6 billion parameters, built for servers or quantized on-device use.
All three variants share parameters and train together. The authors say this allows the system to adapt to different hardware without requiring separate model training for each target. For mobile AI, that matters because device capabilities vary widely, even when the use case looks the same to the user.
The team also developed a specialized distillation method to make inference more efficient. It reduces the required inference steps from 28 to just four while preserving nearly the same quality. Fewer steps directly support the short generation time reported for the iPhone 16 Pro Max.
Small model, large-model comparisons
The benchmark claims are striking because SnapGen++ is compact. The small version has 0.4 billion parameters, yet it consistently outperforms Flux.1-dev in both image quality and text-image alignment tests, according to the research paper. Flux.1-dev has 12 billion parameters, making it 30 times larger.
The source also says SD3.5-Large, with 8.1 billion parameters, falls short of Snap’s largest model. These comparisons are important because they frame SnapGen++ not only as a faster phone model, but as a compact model that can compete with much larger systems in the tested areas.
That does not mean every mobile device will deliver the same result. The reported 1.8-second figure is tied to the small version running on an iPhone 16 Pro Max. Snap’s own model lineup suggests that hardware differences still matter, which is why the Tiny, Small, and Full variants exist.
Why this matters for Snapchat’s AI direction
Snap has already been investing in AI features inside its messaging app. The source points to its in-house chatbot "My AI" and a $400 million partnership with Perplexity AI announced in November 2024. The AI search engine is expected to be integrated into Snapchat by default this year.
SnapGen++ fits into that broader push by moving more image generation capability onto the device itself. The source says SnapGen previously enabled 1024-pixel images on smartphones, but it could not match the quality of large server models. SnapGen++ is presented as the next step: a diffusion transformer designed for high-resolution on-device generation.
Other companies, including Google, are also working on efficient diffusion models for mobile devices. But according to the research paper, SnapGen++ is the first to ship an efficient diffusion transformer for high-resolution on-device generation.
The broader signal is clear: mobile AI image generation is becoming less about whether a phone can generate an image at all, and more about how close that output can get to server-quality systems while staying fast. SnapGen++ makes that race more concrete by pairing a 0.4 billion parameter mobile model with 1024 x 1024 pixel output and a reported 1.8-second generation time on an iPhone 16 Pro Max.