Ars Technica AI February 27, 2025 TERMINATOR

Copilot kept surfacing private GitHub repositories

Microsoft’s Copilot AI assistant made more than 20,000 private GitHub repositories accessible after they had previously been public. Lasso traced the issue to Bing’s cache and said Microsoft’s initial fix blocked human access while Copilot could still retrieve cached data.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 0 ►

Copilot exposed private code and secrets through cached access, creating a concrete security and control risk.

Copilot kept surfacing private GitHub repositories

Private code does not always become private everywhere at the same time. According to findings from AI security firm Lasso, Microsoft’s Copilot AI assistant exposed the contents of more than 20,000 private GitHub repositories that had once been public, affecting repositories from more than 16,000 organizations.

The affected repositories included material from companies including Google, Intel, Huawei, PayPal, IBM, Tencent and Microsoft. Lasso’s research shows why developers cannot rely on changing a repository from public to private as the only response when credentials, keys or other confidential material have already been published.

What Lasso Found

Lasso discovered the behavior in the second half of 2024. After finding in January that Copilot continued to store private repositories and make them available, the company investigated the scale of the problem.

The repositories at issue were not always private. They had originally been posted to GitHub as public repositories and were later changed to private, often after developers realized the code contained authentication credentials or other confidential information.

Even months after that privacy change, Lasso found that the repository contents could still be accessed through Copilot. The researchers referred to these as “zombie repositories,” meaning repositories that were once public and are now private.

Lasso researchers Ophir Dror and Bar Lanyado wrote: “After realizing that any data on GitHub, even if public for just a moment, can be indexed and potentially exposed by tools like Copilot, we were struck by how easily this information could be accessed,” and said they automated the process of identifying and validating those repositories.

How Bing Cache Kept The Data Alive

Lasso traced the exposure to the cache mechanism in Bing after discovering that Microsoft was exposing one of Lasso’s own private repositories. Bing had indexed pages while they were public and did not remove the cached entries after the pages were later made private on GitHub.

That mattered because Copilot used Bing as its primary search engine. If cached repository data remained available to Bing, Copilot could retrieve it and provide it to a user who asked.

Lasso reported the issue in November. Microsoft then introduced changes intended to fix the problem. Lasso confirmed that private data was no longer available through Bing cache, but the researchers later found that Copilot could still access cached material.

According to Lasso, Microsoft’s fix appeared to cut off public access to a special Bing user interface once available at cc.bingj.com. But the underlying cached pages did not appear to be fully cleared. In practice, that meant human users were blocked from one path to the cached data while Copilot still had access.

Although Bing’s cached link feature was disabled, cached pages continued to appear in search results. This indicated that the fix was a temporary patch and while public access was blocked, the underlying data had not been fully removed.

Why Making A Repository Private Was Not Enough

The core lesson is simple: once sensitive code has been public, changing the visibility setting may not contain the exposure. Search caches and AI tools can preserve access to material that developers believe has been withdrawn.

The risk is especially serious when repositories contain secrets. The source article notes that developers frequently embed security tokens, private encryption keys and other sensitive information directly into code, despite long-standing best practices that call for handling such data through more secure methods.

When that kind of information lands in a public GitHub repository, the damage can continue after the repository is made private. Lasso’s findings show that credentials exposed in that way should be treated as compromised. The only recourse identified in the source is to rotate all credentials.

That response, however, does not solve every kind of exposure. Some repositories may contain sensitive data that cannot simply be rotated like a token or key. In those cases, privacy changes and removals may still leave copies available through systems that indexed or cached the content while it was public.

The Microsoft Lawsuit Example

Lasso also found a GitHub repository that had been made private following a lawsuit Microsoft had filed. The suit alleged that the repository hosted tools specifically designed to bypass the safety and security guardrails in Microsoft’s generative AI services.

The repository was later removed from GitHub. But Lasso found that Copilot continued to make the tools available anyway. That created an unusual situation: Microsoft had worked to get material removed from GitHub, while Copilot still surfaced it from cached data.

The source article says Microsoft incurred legal expenses and alleged violations including the Computer Fraud and Abuse Act, the Digital Millennium Copyright Act, the Lanham Act, and the Racketeer Influenced and Corrupt Organizations Act. Company lawyers succeeded in getting the tools removed, yet Copilot continued undermining that result by making the tools available.

What Developers Should Take From This

The practical implication is that repository visibility is only one layer of control. If a GitHub repository was public, even briefly, developers should assume that outside systems may have indexed it.

For teams, the immediate checklist is narrow but important:

Treat any exposed credentials as compromised.
Rotate all credentials that appeared in public code.
Do not assume that switching a repository to private removes cached copies.
Keep repositories private at all times when the content should not be public.

Microsoft said in an emailed statement: “It is commonly understood that large language models are often trained on publicly available information from the web. If users prefer to avoid making their content publicly available for training these models, they are encouraged to keep their repositories private at all times.”

That position puts the burden on keeping sensitive repositories private from the beginning. Lasso’s findings show why the moment of first exposure matters: once code becomes public, later attempts to hide it may not reach every cache, search path or AI assistant that has already seen it.