Il y a près de neuf ans, Cloudflare était une toute petite entreprise dont j’étais le client, et non l’employé. Cloudflare était sorti depuis un mois et un jour, une notification m’alerte que mon petit site, jgc.org, semblait ne plus disposer d’un DNS fonctionnel. Cloudflare avait effectué une modification dans l’utilisation de Protocol Buffers qui avait endommagé le DNS.
J’ai contacté directement Matthew Prince avec un e-mail intitulé « Où est mon DNS ? » et il m’a envoyé une longue réponse technique et détaillée (vous pouvez lire tous nos échanges d’e-mails ici) à laquelle j’ai répondu :
De: John Graham-Cumming
Date: Jeudi 7 octobre 2010 à 09:14
Objet: Re: Où est mon DNS?
À: Matthew Prince
Superbe rapport, merci. Je veillerai à vous appeler s’il y a un
problème. Il serait peut-être judicieux, à un certain moment, d’écrire tout cela dans un article de blog, lorsque vous aurez tous les détails techniques, car je pense que les gens apprécient beaucoup la franchise et l’honnêteté sur ce genre de choses. Surtout si vous y ajoutez les tableaux qui montrent l’augmentation du trafic suite à votre lancement.
Je dispose d’un système robuste de surveillance de mes sites qui m’envoie un Continue reading
大约九年前,Cloudflare 还是一家小公司,我也还是客户,而不是员工。当时,Cloudflare 早在一个月前就已发布了 jgc.org,有一天警报消息显示,这个小网站似乎不再支持 DNS 了。Cloudflare 实施了一项对 Protocol Buffers 使用的改动,这破坏了 DNS。
我直接给 Matthew Prince 写了一封题为“我的 DNS 在哪儿?”的邮件,他回复了一封篇幅很长、内容详实的技术性解答邮件(您可以点击此处查看往来邮件的全部内容),我对该邮件的回复是:
发件人:John Graham-Cumming
日期:2010 年 10 月 7 日星期四上午 9:14
主题:回复:我的 DNS 在哪儿?
收件人:Matthew Prince
谢谢,这是一篇很棒的报告。如果有问题,我一定会去电
问询。 就某种程度而言,在掌握了所有技术细节后,
将它们撰写为一篇博客文章可能会更好,因为我认为
读者会非常感谢博主对这些信息的坦诚公开。
这一点在您看到文章发布后流量增加的图表时,
会感触更深。
我在密切监控着网站,以便在出现任何故障时能够
收到短信通知。 监控显示,我的网站在 13:03:07 至
14:04:12 期间流量下降。 我会每五分钟测试一次。
这只是个小插曲,我相信您会解决这个问题。 但您确定您不需要
有人在欧洲为您分忧吗?:-)
他的回复是:
发件人:Matthew Prince
日期:2010 年 10 月 7 日星期四上午 9:57
主题:回复:我的 DNS 在哪儿?
收件人:John Graham-Cumming
谢谢。我们已经回复了所有来信。我现在要去办公室,
我们会在博客上发布些信息,或在我们的公告栏系统中
置顶一篇官方帖文。我同意 100%
透明度是最好的。
因此,今天,作为规模远胜以往的 Cloudflare 公司的一员,我要写一篇文章,清楚讲述我们所犯的错误、它的影响以及我们正在为此采取的行动。
7 月 2 日,我们在 WAF 托管规则中部署了一项新规则,导致全球 Cloudflare 网络上负责处理 HTTP/HTTPS 流量的各 CPU 核心上的 CPU 耗尽。我们在不断改进 WAF 托管规则,以应对新的漏洞和威胁。例如,我们在 5 月份以更新 WAF 的速度出台了一项规则,以防范严重的 SharePoint 漏洞。能够快速地全局部署规则是 WAF 的一个重要特征。
遗憾的是,上周二的更新中包含了一个规则表达式,它在极大程度上回溯并耗尽了用于 HTTP/HTTPS 服务的 CPU。这降低了 Cloudflare 的核心代理、CDN 和 WAF 功能。下图显示了专用于服务 HTTP/HTTPS 流量的 CPU,在我们网络中的服务器上,这些 CPU 的使用率几乎达到了 100%。
这导致我们的客户(以及他们的客户)在访问任何 Cloudflare 域时都会看到 502 错误页面。502 错误是由前端 Cloudflare Web 服务器生成的,这些服务器仍有可用的 CPU 内核,但无法访问服务 HTTP/HTTPS 流量的进程。
我们知道这对我们的客户造成了多大的伤害。我们为发生这种事件感到羞耻。在我们处理这一事件时,它也对我们自身的运营产生了负面影响。
如果您是我们的客户,您也一定感受到了难以置信的压力、沮丧和恐惧。更令人懊恼的是,我们的六年零全球中断记录也就此打破。
CPU 耗尽是由一个 WAF 规则引起的,该规则里包含不严谨的正则表达式,最终导致了过多的回溯。作为中断核心诱因的正则表达式是 (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
尽管正则表达式本身成为很多人关注的焦点(下文将进行详细讨论),但 Cloudflare 服务中断 27 分钟的真实情况要比“正则表达式出错”复杂得多。我们已经花时间写下了导致中断并使我们无法快速响应的一系列事件。如果您想了解更多关于正则表达式回溯以及如何处理该问题的信息,可在本文末尾的附录中查找。
我们按事情发生的先后次序讲述。本博客中的所有时间均为协调世界时 (UTC)。
在 13:42,防火墙团队的一名工程师通过一个自动过程对 XSS 检测规则进行了微小改动。这生成了变更请求票证。我们使用 Jira 管理这些票证,下面是截图。
三分钟后,第一个 PagerDuty 页面出现,显示 WAF 故障。这是一项综合测试,从 Cloudflare 外部检查 WAF 的功能(我们会进行数百个此类测试),以确保其正常工作。紧接着出现了多个页面,显示许多其他的 Cloudflare 服务端到端测试失败、全球流量下降警报、众多的 502 错误,之后便是我们在全球各城市的网点 (PoP) 发来的大量指示 CPU 耗尽的报告。
我收到了其中部分警告并立马起身走出会议室,而正在我回到办公桌的途中,解决方案工程师团队的一名负责人告诉我,我们的流量已经减少了 80%。我跑向 SRE 团队,他们正在排除故障。在中断的最初时刻,有人猜测这是某种我们从未见过的攻击。
Cloudflare 的 SRE 团队成员分布在世界各地,他们全天持续监控着网络。绝大多数此类警报都指出了局部区域有限范围内的非常具体的问题,这些警报均在内部仪表板中监控,并且每天会进行多次处理。但这种页面和警报模式表明发生了严重问题,SRE 立即宣布发生 P0 事件,并上报给工程领导层和系统工程部门。
当时,伦敦工程团队正在我们的主要活动场地听取一场内部技术讲座。讲座被打断,所有人都聚集在大型会议室中,商讨问题或是接打电话。这不是 SRE 能够独立处理的一般问题,它需要所有相关团队立即在线联合处理。
在 14:00,WAF 被确定为导致问题的部分原因,并排除了攻击的可能性。性能团队从一台清楚表明 WAF Continue reading
Vor etwa neun Jahren war Cloudflare noch ein winziges Unternehmen und ich war ein Kunde, kein Mitarbeiter. Cloudflare gab es erst seit einem Monat. Eines Tages wurde ich darüber benachrichtigt, dass bei meiner kleinen Website jgc.org der DNS-Service nicht mehr funktionierte. Cloudflare hat seine Verwendung von Protocol Buffers angepasst und dadurch wurde der DNS-Service unterbrochen.
Ich habe eine E-Mail mit dem Titel „Where‘s my dns?“ (Wo ist mein DNS) direkt an Matthew Prince gesendet und er hat mit einer langen, detaillierten, technischen Erklärung reagiert (Sie können den vollständigen E-Mail-Austausch hier lesen), auf die ich antwortete:
Von: John Graham-Cumming
Datum: Do., 7. Okt. 2010 um 09:14
Betreff: Re: Wo ist mein DNS?
An: Matthew Prince
Toller Bericht, danke. Ich werde auf jeden Fall anrufen, wenn es ein
Problem geben sollte. Es wäre wahrscheinlich sinnvoll, all das in
einem Blog-Beitrag festzuhalten, wenn Sie alle technischen Details haben. Ich glaube nämlich,
dass es Kunden wirklich zu schätzen wissen, wenn mit solchen Dingen offen und ehrlich umgegangen wird.
Sie könnten auch die Traffic-Zunahme nach der Implementierung mit
Diagrammen veranschaulichen.
Ich habe eine recht zuverlässige Überwachung für meine Websites eingerichtet, deshalb bekomme ich eine SMS, wenn
etwas ausfällt. Meine Daten zeigen, Continue reading
To learn more about the origins of The Network is the Computer®, I spoke with John Gage, the creator of the phrase and the 21st employee of Sun Microsystems. John had a key role in shaping the vision of Sun and had a lot to share about his vision for the future. Listen to our conversation here and read the full transcript below.
John Graham-Cumming: I’m talking to John Gage who was what, the 21st employee of Sun Microsystems, which is what Wikipedia claims and it also claims that you created this phrase “The Network is the Computer,” and that's actually one of the things I want to talk about with you a little bit because I remember when I was in Silicon Valley seeing that slogan plastered about the place and not quite understanding what it meant. So do you want to tell me what you meant by it or what Sun meant by it at the time?
John Gage: Well, in 2019, recalling what it meant in 1982 or 83’ will be colored by all our experience since then but at the time it seemed so obvious that when we introduced the first scientific workstations, they Continue reading
Last week I spoke with Ray Rothrock, former Director of CAD/CAM Marketing at Sun Microsystems, to discuss his time at Sun and how the Internet has evolved. In this conversation, Ray discusses the importance of trust as a principle, the growth of Sun in sales and marketing, and that time he gave Vice President Bush a Sun demo. Listen to our conversation here and read the full transcript below.
John Graham-Cumming: Here I am very lucky to get to talk with Ray Rothrock who was I think one of the first investors in Cloudflare, a Series A investor and got the company a little bit of money to get going, but if we dial back a few earlier years than that, he was also at Sun as the Director of CAD/CAM Marketing. There is a link between Sun and Cloudflare. At least one, but probably more than one, which is that Cloudflare has recently trademarked, “The Network is the Computer”. And that was a Sun trademark, wasn’t it?
Ray Rothrock: It was, yes.
Graham-Cumming: I talked to John Gage and I asked him about this as well and I asked him to explain to me what it Continue reading
I spoke with Greg Papadopoulos, former CTO of Sun Microsystems, to discuss the origins and meaning of The Network is the Computer®, as well as Cloudflare’s role in the evolution of the phrase. During our conversation, we considered the inevitability of latency, the slowness of the speed of light, and the future of Cloudflare’s newly acquired trademark. Listen to our conversation here and read the full transcript below.
John Graham-Cumming: Thank you so much for taking the time to chat with me. I've got Greg Papadopoulos who was CTO of Sun and is currently a venture capitalist. Tell us about “The Network is the Computer.”
Greg Papadopoulos: Well, from certainly a Sun perspective, the very first Sun-1 was connected via Internet protocols and at that time there was a big war about what should win from a networking point of view. And there was a dedication there that everything that we made was going to interoperate on the network over open standards, and from day one in the company, it was always that thought. It's really about the collection of these machines and how they interact with one another, and of course that puts the network in Continue reading
We recently registered the trademark for The Network is the Computer®, to encompass how Cloudflare is utilizing its network to pave the way for the future of the Internet.
The phrase was first coined in 1984 by John Gage, the 21st employee of Sun Microsystems, where he was credited with building Sun’s vision around “The Network is the Computer.” When Sun was acquired in 2010, the trademark was not renewed, but the vision remained.
Take it from him:
“When we built Sun Microsystems, every computer we made had the network at its core. But we could only imagine, over thirty years ago, today’s billions of networked devices, from the smallest camera or light bulb to the largest supercomputer, sharing their packets across Cloudflare’s distributed global network.
We based our vision of an interconnected world on open and shared standards. Cloudflare extends this dedication to new levels by openly sharing designs for security and resilience in the post-quantum computer world.
Most importantly, Cloudflare is committed to immediate, open, transparent accountability for network performance. I’m a dedicated reader of their technical blog, as the network becomes central to our security infrastructure and the global economy, demanding even more powerful technical innovation. Continue reading
Cloudflare come out strong, pointing the finger at Verizon for shoddy practices putting the Internet at risk. It didn’t take long for karma to come around and for Cloudflare to have their own Internet impacting outage from a mistake of their own. In this episode we talk about that outage, the risk of centralization on the Internet, managing MSPs when trouble strikes, and whether or not agile processes are forgoing security in favor of faster releases.
Outro Music:
Danger Storm Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/
The post Cloudflare’s Karma, Managing MSPs, & Agile Security appeared first on Network Collective.
For some time I’ve wanted to play with coverage-guided fuzzing. Fuzzing is a powerful testing technique where an automated program feeds semi-random inputs to a tested program. The intention is to find such inputs that trigger bugs. Fuzzing is especially useful in finding memory corruption bugs in C or C++ programs.
Normally it's recommended to pick a well known, but little explored, library that is heavy on parsing. Historically things like libjpeg, libpng and libyaml were perfect targets. Nowadays it's harder to find a good target - everything seems to have been fuzzed to death already. That's a good thing! I guess the software is getting better! Instead of choosing a userspace target I decided to have a go at the Linux Kernel netlink machinery.
Netlink is an internal Linux facility used by tools like "ss", "ip", "netstat". It's used for low level networking tasks - configuring network interfaces, IP addresses, routing tables and such. It's a good target: it's an obscure part of kernel, and it's relatively easy to automatically craft valid messages. Most importantly, we can learn a lot about Linux internals in the process. Bugs in netlink aren't going Continue reading
It’s a crazy idea to think that a network built to be completely decentralized and resilient can be so easily knocked offline in a matter of minutes. But that basically happened twice in the past couple of weeks. CloudFlare is a service provide that offers to sit in front of your website and provide all kinds of important services. They can prevent smaller sites from being knocked offline by an influx of traffic. They can provide security and DNS services for you. They’re quickly becoming an indispensable part of the way the Internet functions. And what happens when we all start to rely on one service too much?
The first outage on June 24, 2019 wasn’t the fault of CloudFlare. A small service provider in Pennsylvania decided to use a BGP Optimizer from Noction to do some route optimization inside their autonomous system (AS). That in and of itself shouldn’t have caused a problem. At least, not until someone leaked those routes to the greater internet.
It was a comedy of errors. The provider in question announced their more specific routes to an upstream customer, who in turn announced them to Verizon. After that all bets are Continue reading
This is a short placeholder blog and will be replaced with a full post-mortem and disclosure of what happened today.
For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.
This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.
Starting at 1342 UTC today we experienced a global outage across our network that resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”). The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.
The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were Continue reading
This is a short placeholder blog and will be replaced with a full post-mortem and disclosure of what happened today.
For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.
This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.
Starting at 1342 UTC today we experienced a global outage across our network that resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”). The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.
The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were Continue reading
Happy Pride from Proudflare, Cloudflare’s LGBTQIA+ employee resource group. We wanted to share some stories from our members this month which highlight both the struggles behind the LGBTQIA+ rights movement and its successes. This first story is from Lesley.
The moment that crystalised the memory of that day…crystal blue afternoon, bright-coloured autumn leaves, borrowed tables, crockery and cutlery, flowers arranged by a cousin, cake baked by a neighbour, music mixed by a friend... our priest/rabbi a close gay friend with neither yarmulke nor collar. The venue, a backyard kitty-corner at the home my wife grew up in. Love and good wishes in abundance from a community that supports us and our union. And in the middle of all that, my wife… turning to me and smiling, grass stains on the bottom of her long cream wedding dress after abandoning her heels and dancing barefoot in the grass. As usual, a microphone in hand, bringing life and laughter to all with her charismatic quips.
This was the fall of 2002 and same-sex marriage was legal in 0 of the 50 United States.
Happy Pride from Proudflare, Cloudflare’s LGBTQIA+ employee resource group. We wanted to share some stories from our members this month which highlight both the struggles behind the LGBTQIA+ rights movement and its successes. This first story is from Lesley.
The moment that crystalised the memory of that day…crystal blue afternoon, bright-coloured autumn leaves, borrowed tables, crockery and cutlery, flowers arranged by a cousin, cake baked by a neighbour, music mixed by a friend... our priest/rabbi a close gay friend with neither yarmulke nor collar. The venue, a backyard kitty-corner at the home my wife grew up in. Love and good wishes in abundance from a community that supports us and our union. And in the middle of all that, my wife… turning to me and smiling, grass stains on the bottom of her long cream wedding dress after abandoning her heels and dancing barefoot in the grass. As usual, a microphone in hand, bringing life and laughter to all with her charismatic quips.
This was the fall of 2002 and same-sex marriage was legal in 0 of the 50 United States.
On Monday we wrote about a painful Internet wide route leak. We wrote that this should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. That blog entry came out around 19:58 UTC, just over seven hours after the route leak finished (which will we see below was around 12:39 UTC). Today we will dive into the archived routing data and analyze it. The format of the code below is meant to use simple shell commands so that any reader can follow along and, more importantly, do their own investigations on the routing tables.
This was a very public BGP route leak event. It was both reported online via many news outlets and the event’s BGP data was reported via social media as it was happening. Andree Toonk tweeted a quick list of 2,400 ASNs that were affected.
Quick dumps through the data, showing about 2400 ASns (networks) affected. Cloudflare being hit the hardest. Top 20 of affected ASns below pic.twitter.com/9J7uvyasw2
— Andree Toonk (@atoonk) June 24, 2019
The RIPE NCC operates a very useful archive of BGP routing. Continue reading
On Monday we wrote about a painful Internet wide route leak. We wrote that this should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. That blog entry came out around 19:58 UTC, just over seven hours after the route leak finished (which will we see below was around 12:39 UTC). Today we will dive into the archived routing data and analyze it. The format of the code below is meant to use simple shell commands so that any reader can follow along and, more importantly, do their own investigations on the routing tables.
This was a very public BGP route leak event. It was both reported online via many news outlets and the event’s BGP data was reported via social media as it was happening. Andree Toonk tweeted a quick list of 2,400 ASNs that were affected.
Quick dumps through the data, showing about 2400 ASns (networks) affected. Cloudflare being hit the hardest. Top 20 of affected ASns below pic.twitter.com/9J7uvyasw2
— Andree Toonk (@atoonk) June 24, 2019
This blog contains a large number of acronyms and those are explained at the end of Continue reading
On June 6th 2019, Cloudflare hosted the first ever customer event in a beautiful and green district of Bangalore, India. More than 60 people, including executives, developers, engineers, and even university students, have attended the half day forum.
The forum kicked off with a series of presentations on the current DDoS landscape, the cyber security trends, the Serverless computing and Cloudflare’s Workers. Trey Quinn, Cloudflare Global Head of Solution Engineering, gave a brief introduction on the evolution of edge computing.
We also invited business and thought leaders across various industries to share their insights and best practices on cyber security and performance strategy. Some of the keynote and penal sessions included live demos from our customers.
At this event, the guests had gained first-hand knowledge on the latest technology. They also learned some insider tactics that will help them to protect their business, to accelerate the performance and to identify the quick-wins in a complex internet environment.
To conclude the event, we arrange some dinner for the guests to network and to enjoy a cool summer night.
Through this event, Cloudflare has strengthened the connection with the local tech community. The success of the event cannot be separated from the Continue reading
On June 6th 2019, Cloudflare hosted the first ever customer event in a beautiful and green district of Bangalore, India. More than 60 people, including executives, developers, engineers, and even university students, have attended the half day forum.
The forum kicked off with a series of presentations on the current DDoS landscape, the cyber security trends, the Serverless computing and Cloudflare’s Workers. Trey Quinn, Cloudflare Global Head of Solution Engineering, gave a brief introduction on the evolution of edge computing.
We also invited business and thought leaders across various industries to share their insights and best practices on cyber security and performance strategy. Some of the keynote and penal sessions included live demos from our customers.
At this event, the guests had gained first-hand knowledge on the latest technology. They also learned some insider tactics that will help them to protect their business, to accelerate the performance and to identify the quick-wins in a complex internet environment.
To conclude the event, we arrange some dinner for the guests to network and to enjoy a cool summer night.
Through this event, Cloudflare has strengthened the connection with the local tech community. The success of the event cannot be separated from the Continue reading
Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare's offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.
While these analytics Continue reading
Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare's offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.
While these analytics Continue reading