Nvidia is preparing Rubin CPX, a new GPU class designed around one of the hardest parts of modern AI: processing enormous amounts of context before a model begins producing an answer. The company is positioning the chip as a dedicated answer to workloads that need millions of tokens, including full codebase analysis and video generation.
Why the context phase matters
When a language model handles a complex request, the work can be understood in two broad phases. First, the model takes in and analyzes the material it has been given. Then it generates the response step by step.
The source example is simple: if a model is asked to summarize a long book, it must first read and evaluate the entire text. That first step is the compute-heavy analysis or context phase. Only after that does the generation phase begin, where the model creates the summary word by word.
Nvidia’s argument is that those two phases do not stress hardware in the same way. The context phase is compute-bound, while the generation phase that follows is memory bandwidth-bound. Rubin CPX is being built for the first of those jobs.
That distinction is important because AI applications are increasingly asking models to work with far more information at once. A system that can examine an entire codebase, or process large inputs for video generation, needs hardware that can carry the heavy analysis load efficiently before output begins.
What Rubin CPX is designed to do
According to an official announcement, Rubin CPX is a derivative of the Rubin product line planned for 2025. It is set to launch at the end of 2026 and will be offered either as a plug-in card or as a standalone computer for data centers.
Nvidia says Rubin CPX is aimed at AI applications with massive context windows. CEO Jensen Huang calls CPX the first CUDA GPU built specifically for "massive-context AI." In practical terms, the chip is meant for situations where the model must process a large body of input before it can produce a useful result.
The technical direction is tied to Nvidia’s disaggregated inference strategy. Instead of treating inference as one uniform workload, the approach separates the context phase from the generation phase. Different hardware can then be used where it fits best.
Rubin CPX’s stated specifications reflect that role. Nvidia says the chip will use a monolithic die design, deliver 30 petaFLOPs of NVFP4 compute, include 128 GB of GDDR7 memory, and provide triple the attention layer acceleration compared to Blackwell.
Those details point to a GPU built around the part of AI inference where attention and large-context processing matter most. The goal is not simply to make a general-purpose accelerator faster, but to build a device for the part of the workload Nvidia sees as structurally different.
How split inference works
Split inference is the broader idea behind Rubin CPX. The source describes it as a strategy where the analysis phase and the generation phase are handled separately because their hardware needs differ.
That separation can already be done in software. Nvidia has used a related approach with Blackwell Ultra, where the context and generation stages were divided across different GPUs. Rubin CPX is the next step: a dedicated hardware product for the compute-heavy context side.
The logic is straightforward. If the first phase is constrained by compute and the second phase is constrained by memory bandwidth, one kind of GPU may not be the ideal fit for both. Assigning each phase to hardware better suited to its demands could improve throughput for workloads with very large inputs.
For users, the most visible impact would be in AI systems that need to inspect a large amount of material before answering. That includes code tools, video AI systems, and AI agents, all of which are named in the source as areas where partners are evaluating the technology.
Blackwell Ultra gives Nvidia a proof point
Nvidia is using recent benchmark results to support the strategy. In the latest round of the MLPerf Inference v5.1 industry standard, the company submitted Blackwell Ultra results for the first time and set new records.
Blackwell Ultra, running in the GB300 NVL72 rack system, delivered up to 45 percent higher performance per GPU than standard Blackwell, according to Nvidia. The company also says throughput on the new DeepSeek-R1 benchmark is five times higher than the previous Hopper architecture.
The more relevant detail for Rubin CPX is that these tests already used a software version of split inference. For the Llama 3.1 405B model, Nvidia used "disaggregated serving" to divide the context and generation phases across different GPUs.
That method was managed by the Nvidia Dynamo Framework. Nvidia says it boosted per-GPU throughput by nearly 1.5x compared to traditional methods.
In other words, Blackwell Ultra is not Rubin CPX, but Nvidia is presenting its results as evidence that separating inference into stages can pay off. Rubin CPX would turn that software pattern into a more specialized hardware path.
Who is already looking at it
Nvidia says partners including Cursor, Runway, and Magic are evaluating the technology for their own use cases. The source identifies Cursor as a code editor, Runway as video AI, and Magic as AI agents.
Those examples line up with the kinds of workloads Nvidia is targeting. Code tools may need to reason across an entire codebase. Video AI can involve large amounts of data. Agent systems can require broad context before deciding what to do next.
Rubin CPX is therefore best understood as part of a larger shift in AI infrastructure. As context windows grow, the cost of reading and analyzing input becomes a central challenge. Nvidia’s answer is to stop treating inference as a single hardware problem and to build a GPU specifically for the phase where massive context is processed.