When Cloudflare, a leading internet security and performance company, attempted to respond to a phishing threat in its R2 object storage platform, the company inadvertently caused a significant disruption that would last for 59 minutes. This accident was precipitated by a well-intentioned but flawed attempt to address an abuse report. Instead of merely blocking access to the specific endpoint associated with the phishing activity, a Cloudflare employee mistakenly disabled the entire R2 Gateway service, leading to widespread outages across several Cloudflare services.
The Root Cause of the Outage
Mistake in Blocking a Phishing URL
On the day of the incident, Cloudflare received an abuse report indicating a phishing threat within its R2 object storage platform. In an attempt to address the issue promptly, a Cloudflare employee performed an action aimed at blocking the specific endpoint involved in the abusive activity. Unfortunately, instead of isolating and mitigating the endpoint, the employee unintentionally disabled the entire R2 Gateway service. This error resulted in a cascade of failures that affected numerous Cloudflare services.
The disruption did not only impact the R2 Object Storage, but it also led to a complete failure in Cloudflare’s Stream service, crucial for video uploads and streaming. Users experienced a 100% failure rate for both image uploads and downloads in the Images service. Additionally, Cache Reserve, another Cloudflare service, struggled with operations, resulting in increased origin requests and affecting several users. Further compounding the disruption, Cloudflare’s Vectorize experienced significant downtime in its query, insert, upsert, and delete operations. This chain of failures underscored the consequences of failing to adequately isolate high-risk actions.
Impact on Various Services
The outage had a profound impact on a range of Cloudflare services. For instance, the R2 Object Storage system suffered significant downtime, resulting in halted operations for businesses reliant on this service for critical storage needs. Similarly, Stream services experienced a 100% failure rate, obstructing users from both uploading and streaming video content. Notably, Cloudflare’s Images service faced a complete interruption, inhibiting crucial uploads and downloads of images, which many businesses rely upon for visual content delivery.
Cache Reserve, another essential service, encountered operational disruptions leading to increased requests from origins, subsequently overloading the system. Vectorize, an integral service experiencing 75% failures in queries and complete failures in insert, upsert, and delete operations, further highlighted the widespread impact of the outage. Moreover, Cloudflare’s Log Delivery service faced data loss challenges, with up to 13.6% data loss for logs related to R2 and 4.5% data loss for non-R2 jobs. Even services like Durable Objects, Cache Purge, and Workers & Pages did not escape unscathed, experiencing varying levels of partial failures that impacted their functionality significantly.
Steps Taken in Response
Corrective Measures
In response to this significant outage, Cloudflare took immediate corrective measures to mitigate the risk of recurrence. One of the first actions involved removing the ability to disable critical services through the abuse review interface, thereby reducing the likelihood of accidental service disablement. Additionally, Cloudflare implemented tighter restrictions within the Admin API to prevent the unintentional disabling of services, particularly in internal accounts. By tightening these controls, Cloudflare aimed to enhance the security and reliability of its internal systems.
Furthermore, the company undertook more extensive improvements in its overall account provisioning processes. By doing so, Cloudflare aimed to introduce stricter access controls to ensure that only qualified personnel had the necessary permissions to execute high-impact actions. Another critical step in the new process was the introduction of a two-party approval mechanism for high-risk actions, ensuring that no significant changes could be made unilaterally. This move was designed to provide an additional layer of oversight and verification, reducing the risk associated with human error and enhancing the robustness of Cloudflare’s service delivery.
Lessons Learned and Future Improvements
When Cloudflare, a prominent internet security and performance company, tried to address a phishing threat in its R2 object storage platform, the company accidentally caused a major disruption lasting 59 minutes. This issue began with a well-meaning but flawed attempt to resolve an abuse report. Rather than simply blocking access to the specific endpoint linked to the phishing activity, a Cloudflare employee mistakenly shut down the entire R2 Gateway service. This error led to extensive outages affecting several Cloudflare services.
The company’s intention was to tackle a reported phishing incident rigorously. However, the misstep of disabling the whole service instead of just the malicious endpoint had far-reaching consequences. The outage disrupted multiple Cloudflare services, underscoring the significance of precise and careful actions in managing cybersecurity issues. Cloudflare’s operations are critical for maintaining internet security and performance, and this incident highlights both the challenges and the importance of meticulous execution in cybersecurity measures.