Wayback Copilot: Using Microsoft's Copilot to Expose Thousands of Private GitHub Repositories

Ophir Dror

Bar Lanyado

February 27, 2025

min read

Wayback Copilot: Using Microsoft's Copilot to Expose Thousands of Private GitHub Repositories

In August 2024, we encountered a LinkedIn post claiming that OpenAI was training on, and exposing, data from private GitHub repositories. Given the seriousness of this claim, our research team immediately set out to investigate.

‍

A quick search showed that the repository in question had once been public, as its pages were still indexed by Bing. However, when we tried accessing these pages directly on GitHub, we were met with a 404 error, confirming that the repository had since been made private.

This raised an important question: How was ChatGPT able to return answers based on content from a repository that was no longer publicly accessible?

‍

‍*Indexed pages of the repository in Bing*

‍

‍

To get to the bottom of this, we repeatedly queried ChatGPT for data related to this repository. Each time, it clarified that it had no direct access to the actual data and instead, it generated hypothetical content based on the repository’s name. We learned that ChatGPT leverages Bing as its search engine as it was published publicly. This mechanism explained how someone could retrieve information from a repository that was once public but later made private: the data had been indexed during its public phase and persisted in Bing’s cache.

But two questions remained: What happened to all of the actual content of the repository once it was set to private? And what other chats might be following the same behavior?

‍

Discovering Our Own Exposed Repository

‍

We decided to dig deeper and investigate whether any of our organizational repositories might have been exposed. A Bing search for repositories under the Lasso organization revealed that one of our own repositories was indexed. However, since the repository had been set to private, accessing it on GitHub returned a 404 error.

An internal review confirmed that it had mistakenly been public for a brief period before being made private again.

‍

*Our private repository is inaccessible without permissions*

‍

Curious about the potential fallout, we tested ChatGPT to see if it could extract data from the indexed pages.

While it sometimes acknowledged the repository’s existence (thanks to Bing’s indexing), it didn’t provide any actual data.

‍

*ChatGPT confirmed the existence of a repository called ‘lassollm’ under the lasso-security GitHub organization.*

‍

*ChatGPT returns the lassollm repository as inaccessible.*

‍

Then a concerning possibility struck us: if Bing had indexed it, the data might be cached somewhere, and who might be a better candidate for accessing it than Microsoft Copilot? After all, everything stays in the family.

Our suspicions were confirmed when Copilot returned actual data from the time when the repository was public, seemingly retrieving a cached snapshot of “Zombie Data” - data that the user think is private or deleted, but is still accessible.

‍

*Our private repository is accessible to anyone with access to copilot.*

‍

This discovery underscored a major security issue:

“Zombie Data”: Any information that was ever public, even for a short period, could remain accessible and distributed by Microsoft Copilot.
Private Code Risks: This vulnerability is particularly dangerous for repositories that were mistakenly published as public before being secured due to the sensitive nature of data stored there.

‍

‍

The Bing Caching Mechanism

‍

Digging deeper into Bing’s indexing system, we discovered cc.bingj.com, a domain owned by Microsoft and stores cached versions of indexed pages. Bing provides a small arrow next to search results, which allows users to view cached versions of pages even if they are no longer accessible.

‍

*The original Azure/responsible-ai-hub page doesn’t exist*

‍

‍*The button to the cached link in Bing*

‍

*The content of the cached page of Azure/responsible-ai-hub*

‍

Scaling Up

‍

After realizing that any data on GitHub, even if public for just a moment, can be indexed and potentially exposed by tools like Copilot, we were struck by how easily this information could be accessed. Determined to understand the full extent of the issue, we set out to automate the process of identifying “zombie repositories” (repositories that were once public and are now private) and validate our findings.

‍

‍

Step 1: Collecting Public Repositories

‍

We used Google BigQuery’s githubarchive dataset, which logs all public repository activities. We extracted a list of repositories that were public at any point in 2024 and that belong to a GitHub organization:

‍

‍

Step 2: Identifying Missing Repositories

‍

We then checked each repository’s status by attempting to access it on GitHub.

If the request returned 200 OK, the repository was still public.
If it returned 404, the repository had either been deleted or set to private.

‍

Step 3: Extracting Cached Data from Bing

‍

For each repository that was no longer public, we searched Bing using:

‍

‍

and extracted any cached pages we found.

‍

Step 4: Scanning the content

Using internal tools, we scanned the extracted data for secrets, tokens, access keys, and packages that are not publicly available (which could pose a risk for dependency confusion).

‍

Research Findings

20,580 GitHub repositories were extracted during the research using Bing’s caching mechanism
16,290 organizations were affected by the Wayback Copilot including Microsoft themselves, Google, Intel, Huwai, Paypal, IBM, Tencent and more
100+ Python and Node.js internal packages that could be vulnerable to dependency confusion were discovered
300+ private tokens, keys & secrets to GitHub, Hugging Face, GCP, OpenAI, etc. were exposed

‍

‍

*Malicious use of the Wayback Copilot mechanism*

‍

Reporting & Fixing

‍

We informed Microsoft of our findings, outlining that deleted or private GitHub repositories were still accessible through Bing’s cache and Copilot and that this functionality posed a security risk by exposing organizational data. We also alerted all affected organizations and advised them to rotate or revoke any compromised keys.

Microsoft classified the issue as low severity, citing its "low impact" and asserting that this caching behavior was acceptable. Nevertheless, they responded quickly and within a two weeks, Bing’s cached link feature was removed, and access to the cc.bingj.com domain was disabled for all users.

‍

Back to Where We Started

Although Bing’s cached link feature was disabled, cached pages continued to appear in search results. This indicated that the fix was only a temporary patch—while public access was blocked, the underlying data had not been fully removed.

When we revisited our investigation of Microsoft Copilot, our suspicions were confirmed: Copilot still had access to the cached data that was no longer available to human users. In short, the fix was only partial, human users were prevented from retrieving the cached data, but Copilot could still access it.

‍

Malicious use of Wayback Copilot

‍

On January 14, 2025 (after the Microsoft fix), we came across another opportunity to test the Wayback Copilot. In an article on Techcrunch we learned that “Microsoft has taken legal action against a group the company claims intentionally developed and used tools to bypass the safety guardrails of its cloud AI products”. It was mentioned that the tool’s repository was removed from GitHub. We got curios about the content of these repositories and decided to chat with our good ol’ friend. Although the repositories were removed (the famous GitHub 404), we were able to retrieve the content of the malicious package.

‍

*A list of indexed files of the removed repository*

‍

*utils.py* *page from the repository is not accessible*

‍

*getting* *utils.py* *code from the de3u repository using Wayback Copilot*

‍

Key Takeaways from Our Research

In today’s digital landscape, data remains a valuable asset, and the explosion of LLMs and generative AI has intensified the race for unique datasets. This pressure is pushing organizations toward riskier data acquisition strategies - even resorting to dark patterns - to gain an edge in training quality and answer accuracy. Here are our primary insights:

‍

1. Assume All Data Is Compromised Once Exposed‍

Modern organizations must operate under the assumption that any data leaving their network, even if public only momentarily, is likely to be ingested by LLM engines or major search engines for future training purposes. It’s more critical than ever to protect and sanitize outgoing data streams. Control every bit of data that exits your perimeter.

‍‍

2. LLM Engines as New Threat Vectors

‍Traditional threat intelligence has focused on scanning the web, forums, or the dark web for leaked data. Now, LLMs and AI copilots introduce a new risk: with just a single prompt, sensitive data could potentially be exposed to a wide audience. This represents a significant shift in how data breaches might occur.
‍

3. Permission Pitfalls and Overly Helpful Systems

‍Our findings reveal two well-known challenges with LLMs and retrieval-augmented generation (RAG) systems: managing permissions and the systems’ inherent eagerness to assist. Combined with certain design choices (Microsoft in our case), these issues can lead to oversharing sensitive information. When deploying or developing your own systems, ensure that users can only access data they’re authorized to see and that you maintain full awareness of what’s being indexed and retrieved.
‍

4. Back to Basics: Fundamental Data Hygiene‍

Despite the evolving landscape, the basics of cybersecurity remain unchanged. Always verify that your repositories are private, avoid publishing secrets or tokens on platforms like GitHub, and mitigate risks such as dependency confusion by using official package repositories (e.g., PyPi, NPM) for private packages. In essence, safeguard your private information and code by keeping them within the organizational boundaries.
‍

If you want to know whether or not you organization was impacted, please reach out to research@lasso.security.

Gartner Names Lasso Security as a Representative Vendor in AI Trust, Risk and Security Management

April 20, 2025

Why MCP Agents Are the Next Cyber Battleground

April 16, 2025

Lasso Research Reveals 13% of Generative AI Prompts Contain Sensitive Organizational Data

April 14, 2025

Seamless integration. Easy onboarding.

Schedule a Demo