MIT Tech Review AI July 4, 2025 NEUTRAL

Why India’s AI independence push hinges on language and compute

India is accelerating its push for AI independence after DeepSeek-R1 exposed how far it still trails the US and China. The central obstacles are not only money and GPUs, but also India’s 22 official languages, limited high-quality local-language data, and debate over whether publicly backed models should be open.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mostly a strategic AI infrastructure and language-capability story, with little emphasis on harm, autonomy, or societal deskilling.

Why India’s AI independence push hinges on language and compute

India’s ambition to build homegrown AI is moving from discussion to execution. The trigger was DeepSeek-R1, a Chinese open-source foundation model that made Indian founders and policymakers reconsider what could be achieved with focused engineering, lower capital, and speed.

The challenge is large. India has a deep software industry and major technical talent, but its AI builders face structural problems that are different from those in the US or China. The country must build models that work across many languages, on limited compute, and inside an ecosystem that has historically rewarded software services more than deep-tech invention.

DeepSeek-R1 forced a faster response

For some Indian AI founders, DeepSeek-R1 was proof that a determined team could challenge better-funded rivals. Adithya Kolavi, the 20-year-old founder of CognitiveLab in Bengaluru, saw the launch as validation that disruption did not have to require Western-scale resources.

For others, the moment was painful. Abhishek Upperwal, founder of Soket AI Labs, had already worked on one of India’s early foundation model efforts. His model, Pragna-1B, had 1.25 billion parameters and was designed to reduce the “language tax” created by India’s many languages. But with small grants and limited resources, the project could not scale beyond a proof of concept.

India’s broader gap is rooted in long-running underinvestment. In 2024, India’s R&D spending was 0.65% of GDP ($25.4 billion), compared with China’s 2.68% ($476.2 billion) and the US’s 3.5% ($962.3 billion). The result is a weaker pipeline for commercializing deep tech, from algorithms to chips.

Research excellence does exist inside government agencies such as the DRDO (Defense Research & Development Organization) and ISRO (Indian Space Research Organization). But those breakthroughs rarely move into civilian or commercial use at the scale needed for a competitive AI industry.

The government moved quickly on foundation models

In January 2025, 10 days after DeepSeek-R1’s launch, the Ministry of Electronics and Information Technology (MeitY) asked for proposals for Indian foundation models. The tender invited private cloud and data-center companies to set aside GPU capacity for government-led AI research.

Providers including Jio, Yotta, E2E Networks, Tata, AWS partners, and CDAC responded. That gave MeitY access to nearly 19,000 GPUs at subsidized rates, using private infrastructure for foundational AI projects.

The response from builders was immediate. Within two weeks, MeitY had 67 proposals. By mid-March, that number had tripled. In April, the government announced plans for six large-scale models by the end of 2025, along with 18 AI applications for sectors including agriculture, education, and climate action.

The most visible move was selecting Sarvam AI to build a 70-billion-parameter model optimized for Indian languages and needs. Sarvam received access to 4,096 Nvidia H100 GPUs to train the model over six months. The company had previously released Sarvam-1, a 2-billion-parameter model trained in 10 Indian languages.

India’s language problem is also its AI problem

India’s linguistic diversity makes foundation model development unusually difficult. The country has 22 official languages, hundreds of dialects, and millions of multilingual people. Most large language models are not built around that reality.

The data problem starts online. English has large volumes of high-quality web data, while Indian languages together represent less than 1% of online content. For languages such as Bhojpuri and Kannada, the shortage of digitized, labeled, and cleaned data makes it harder to train models that reflect how people actually speak and search.

Tokenization adds another barrier. Global tokenizers can perform poorly on Indian scripts by misreading characters or skipping them. Indian languages also use complex scripts and agglutinative grammar, where words may carry many units of meaning through prefixes and suffixes. Standard tokenizers can split those words into too many pieces, making inputs larger and model responses less accurate.

Indian builders are trying to solve that directly. Sarvam AI created OpenHathi-Hi-v0.1, an open-source Hindi language model built on Meta’s Llama 2 architecture and trained on 40 billion tokens of Hindi and related Indian-language content.

Pragna-1B also showed a path forward. Upperwal’s team trained it on 300 billion tokens for just $250,000 and used “balanced tokenization” to improve performance. Upperwal later repurposed the core technology into speech APIs for 22 Indian languages, a more immediate fit for rural users who may not be well served by English-first interfaces.

Other startups are aiming higher. Krutrim-2 is a 12-billion-parameter multilingual language model optimized for English and 22 Indian languages. Its team built a custom Indic tokenizer, optimized training infrastructure, and designed for multimodal and voice-first use cases from the beginning.

Funding, talent, and openness remain unsettled

The IndiaAI Mission is the larger framework behind much of this activity. It is a $1.25 billion national initiative launched in March 2024 to build AI infrastructure and make advanced tools more accessible. Led by MeitY, it supports AI startups, especially those building foundation models in Indian languages and applying AI to health care, education, and agriculture.

Under its compute program, the government is deploying more than 18,000 GPUs, including nearly 13,000 high-end H100 chips, to selected Indian startups. The current group includes Sarvam, Soket Labs, Gnani AI, and Gan AI.

IndiaAI also plans a national multilingual data set repository, AI labs in smaller cities, and funding for deep-tech R&D. Abhishek Singh, CEO of IndiaAI and an officer with MeitY, said India’s wider deep-tech push is expected to raise around $12 billion in research and development investment over the next five years.

That figure includes approximately $162 million through the IndiaAI Mission, with about $32 million for direct startup funding. The National Quantum Mission adds another $730 million for quantum research, while the national budget document for 2025-26 announced a $1.2 billion Deep Tech Fund of Funds for early-stage private-sector innovation. Nearly $9.9 billion is expected from private and international sources.

But money and compute do not settle the question of access. Sarvam’s government-backed model is being built as a closed model, even though the public role in India’s AI infrastructure has raised expectations around openness. Amlan Mohanty, an AI policy specialist, argues that “True sovereignty should be rooted in openness and transparency.”

IndiaAI has chosen not to require a single model. Singh said the program does not want to dictate business models, and that the goal is strong Indian models regardless of route. That neutrality leaves the open-source debate unresolved.

The race is now about execution

IndiaAI has received more than 500 applications from startups proposing use cases in health, governance, agriculture, and other sectors. Singh said support has already been announced for Sarvam, and 10 to 12 more startups will be funded solely for foundational models.

The pressure is clear. India wants AI systems that can compete globally while reflecting its own languages, use cases, and cultural realities. The path depends on more than a single large model. It requires usable data, reliable compute access, deep research talent, patient capital, and products that work for people outside English-first digital life.

DeepSeek-R1 gave India a jolt. What comes next will show whether that urgency can become durable AI infrastructure.