Wayback Copilot: Using Microsoft's Copilot to Expose Thousands of Private Github Repositories
.png)
In August 2024, we encountered a LinkedIn post claiming that OpenAI was training on, and exposing, data from private GitHub repositories. Given the seriousness of this claim, our research team immediately set out to investigate.
A quick search showed that the repository in question had once been public, as its pages were still indexed by Bing. However, when we tried accessing these pages directly on GitHub, we were met with a 404 error, confirming that the repository had since been made private.
This raised an important question: How was ChatGPT able to return answers based on content from a repository that was no longer publicly accessible?
.png)

To get to the bottom of this, we repeatedly queried ChatGPT for data related to this repository. Each time, it clarified that it had no direct access to the actual data and instead, it generated hypothetical content based on the repository’s name. We learned that ChatGPT leverages Bing as its search engine as it was published publicly. This mechanism explained how someone could retrieve information from a repository that was once public but later made private: the data had been indexed during its public phase and persisted in Bing’s cache.
But two questions remained: What happened to all of the actual content of the repository once it was set to private? and what other chats could are possibly following the same pattern?
Discovering Our Own Exposed Repository
We decided to dig deeper and investigate whether any of our organizational repositories might have been exposed. A Bing search for repositories under the Lasso organization revealed that one of our own repositories was indexed. However, since the repository had been set to private, accessing it on GitHub returned a 404 error.
An internal review confirmed that it had mistakenly been public for a brief period before being made private again.

Curious about the potential fallout, we tested ChatGPT to see if it could extract data from the indexed pages.
While it sometimes acknowledged the repository’s existence (thanks to Bing’s indexing), it didn’t provide any actual data.
.png)
.png)
Then a concerning possibility struck us: if Bing had indexed it, the data might be cached somewhere, and who might be a better candidate for accessing it than Microsoft Copilot? After all, everything stays in the family.
Our suspicions were confirmed when Copilot returned actual data from the time when the repository was public, seemingly retrieving a cached snapshot of “Zombie Data” - data that the user think is private or deleted, but is still accessible.

This discovery underscored a major security issue:
- “Zombie Data”: Any information that was ever public, even for a short period, could remain accessible and distributed by Microsoft Copilot.
- Private Code Risks: This vulnerability is particularly dangerous for repositories that were mistakenly published as public before being secured due to the sensitive nature of data stored there.
.png)
The Bing Caching Mechanism
Digging deeper into Bing’s indexing system, we discovered cc.bingj.com, a domain owned by Microsoft and stores cached versions of indexed pages. Bing provides a small arrow next to search results, which allows users to view cached versions of pages even if they are no longer accessible.

.png)

Scaling Up
After realizing that any data on GitHub, even if public for just a moment, can be indexed and potentially exposed by tools like Copilot, we were struck by how easily this information could be accessed. Determined to understand the full extent of the issue, we set out to automate the process of identifying “zombie repositories” (repositories that were once public and are now private) and validate our findings.

Step 1: Collecting Public Repositories
We used Google BigQuery’s githubarchive
dataset, which logs all public repository activities. We extracted a list of repositories that were public at any point in 2024 and that belong to a GitHub organization:

Step 2: Identifying Missing Repositories
We then checked each repository’s status by attempting to access it on GitHub.
- If the request returned
200 OK
, the repository was still public. - If it returned
404
, the repository had either been deleted or set to private.
Step 3: Extracting Cached Data from Bing
For each repository that was no longer public, we searched Bing using:

and extracted any cached pages we found.
Step 4: Scanning the content
Using internal tools we scanned the extracted data for secrets, tokens, access keys and packages that are not publicly available (potential for dependency confusion)
Research Findings
- 20,580 GitHub repositories were extracted during the research using Bing’s caching mechanism
- 16,290 organizations were affected by the Wayback Copilot including Microsoft themselves, Google, Intel, Huwai, Paypal, IBM, Tencent and more
- 100+ Python and Node.js internal packages that could be vulnerable to dependency confusion were discovered
- 300+ private tokens, keys & secrets to GitHub, Hugging Face, GCP, OpenAI, etc. were exposed
.png)
.png)
Reporting & Fixing
We informed Microsoft of our findings, outlining that deleted or private GitHub repositories were still accessible through Bing’s cache and Copilot and that this functionality posed a security risk by exposing organizational data. We also alerted all affected organizations and advised them to rotate or revoke any compromised keys.
Microsoft classified the issue as low severity, citing its "low impact" and asserting that this caching behavior was acceptable. Nevertheless, they responded quickly and within a two weeks, Bing’s cached link feature was removed, and access to the cc.bingj.com
domain was disabled for all users.
Back to Where We Started
Although Bing’s cached link feature was disabled, cached pages continued to appear in search results. This indicated that the fix was a temporary patch and while public access was blocked, the underlying data had not been fully removed.
When we revisited our investigation of Microsoft Copilot, our suspicions were confirmed: Copilot still had access to the cached data that was no longer available to human users. In short, the fix was only partial, human users were prevented from retrieving the cached data, but Copilot could still access it.
Malicious use of Wayback Copilot
On January 14, 2025 (After the Microsoft fix) we came across another opportunity to test the Wayback Copilot. In an article on Techcrunch we learned that “Microsoft has taken legal action against a group the company claims intentionally developed and used tools to bypass the safety guardrails of its cloud AI products”. It was mentioned that the tool’s repository was removed from GitHub. We got curios about the content of these repositories and decided to chat with our good ol’ friend. Although the repositories were removed (the famous GitHub 404), we were able to retrieve the content of the malicious package.
.png)


Key Takeaways from Our Research
In today’s digital landscape, data remains a valuable asset, and the explosion of LLMs and generative AI has intensified the race for unique datasets. This pressure is pushing organizations toward riskier data acquisition strategies - even resorting to dark patterns - to gain an edge in training quality and answer accuracy. Here are our primary insights:
1. Assume All Data Is Compromised Once Exposed
Modern organizations must operate under the assumption that any data leaving their network, even if public only momentarily, is likely to be ingested by LLM engines or major search engines for future training purposes. It’s more critical than ever to protect and sanitize outgoing data streams. Control every bit of data that exits your parameter.
2. LLM Engines as New Threat Vectors
Traditional threat intelligence has focused on scanning the web, forums, or the dark web for leaked data. Now, LLMs and AI copilots introduce a new risk: with just a single prompt, sensitive data could potentially be exposed to a wide audience. This represents a significant shift in how data breaches might occur.
3. Permission Pitfalls and Overly Helpful Systems
Our findings reveal two well-known challenges with LLMs and retrieval-augmented generation (RAG) systems: managing permissions and the systems’ inherent eagerness to assist. Combined with certain design choices (Microsoft in our case), these issues can lead to oversharing sensitive information. When deploying or developing your own systems, ensure that users can only access data they’re authorized to see and that you maintain full awareness of what’s being indexed and retrieved.
4. Back to Basics: Fundamental Data Hygiene
Despite the evolving landscape, the basics of cybersecurity remain unchanged. Always verify that your repositories are private, avoid publishing secrets or tokens on platforms like GitHub, and mitigate risks such as dependency confusion by using official package repositories (e.g., PyPi, NPM) for private packages. In essence, safeguard your private information and code by keeping them within the organizational boundaries.
If you want to know whether or not you organization was impacted, please reach out to research@lasso.security.