We extensively monitor our network and use multiple systems that give us visibility including external monitoring and internal alerts when things go wrong. One of the most useful systems is Grafana that allows us to quickly create arbitrary dashboards. And a heavy user of Grafana we are: at last count we had 645 different Grafana dashboards configured in our system!
grafana=> select count(1) from dashboard;
count
-------
645
(1 row)
This post is not about our Grafana systems though. It's about something we noticed a few days ago, while looking at one of those dashboards. We noticed this:
This chart shows the number of HTTP requests per second handled by our systems globally. You can clearly see multiple spikes, and this chart most definitely should not look like a porcupine! The spikes were large in scale - 500k to 1M HTTP requests per second. Something very strange was going on.
Our intuition indicated an attack - but our attack mitigation systems didn't confirm it. We'd seen no major HTTP attacks at those times.
It would be bad if we were under such heavy HTTP attack and our mitigation systems didn't notice it. Without more ideas, we Continue reading
At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should always have been, at worst, zero. A little later this negative value caused RRDNS to panic. This panic was caught using the recover feature of the Go language. The net effect was that some DNS resolutions to some Cloudflare managed web properties failed.
The problem only affected customers who use CNAME DNS records with Cloudflare, and only affected a small number of machines across Cloudflare's 102 PoPs. At peak approximately 0.2% of DNS queries to Cloudflare were affected and less than 1% of all HTTP requests to Cloudflare encountered an error.
This problem was quickly identified. The most affected machines were patched in 90 minutes and the fix was rolled out worldwide by 0645 UTC. We are sorry that our customers were affected, but we thought it was worth writing up the root cause for others to understand.
Cloudflare customers use our DNS service to serve the authoritative answers for DNS queries for their domains. They need to tell us the IP address of their origin web servers so we can contact the Continue reading
A man, a plan, a canal, a data center. Over 5 million Internet properties are now faster across Panama, as Cloudflare turned up its newest data center in Panama City. This is our 102nd data center globally, and brings us to a special milestone as our network now spans 50 countries. While perhaps not quite as big an announcement as the $5B Panama Canal expansion, the websites of many important newspapers, TV stations, banks and airlines can be accessed directly from Panama.
‘A man a plan a canal’ y un data center! A Partir de hoy más de 5 millones de sitios en internet serán mas rápidos desde Panamá, ya que hemos introducido nuestro más reciente centro de datos en Ciudad de Panamá. Es nuestro centro de datos número 102 a nivel mundial. Es una hito especial ya que la red global de Cloudflare alcanza 50 países a nivel mundial. Tal vez no es un anuncio tan grande como la inauguración de la expansión del canal, pero a partir de hoy muchos de los sitios importantes de diarios, Televisoras, bancos y aerolíneas serán servidos directamente desde Panamá.
Bridge of the Americas
Puente de las Américas
An abbreviated version of this post originally appeared on TechCrunch
Looking back over 2016, we saw the good and bad that comes with widespread use and abuse of the Internet.
In both Gabon and Gambia, Internet connectivity was disrupted during elections. The contested election in Gambia started with an Internet blackout that lasted a short time. In Gabon, the Internet shutdown lasted for days. Even as we write this countries like DR Congo are discussing blocking specific Internet services, clearly forgetting the lessons learned in these other countries.
CC BY 2.0 image by Aniket Thakur
DDoS attacks continued throughout the year, hitting websites big and small. Back in March, we wrote about 400 Gbps attacks that were happening over the weekend, and then in December, it looked like attackers were treating attacks as a job to be performed from 9 to 5.
In addition to real DDoS, there were also empty threats from a group calling itself Armada Collective and demanding Bitcoin for sites and APIs to stay online. Another group popped up to copycat the same modus-operandi.
The Internet of Things became what many had warned it would become: an army of devices used for attacks. A botnet Continue reading
APIs are increasingly becoming the backbone of the modern internet - whether you're ordering food from an app on your phone or browsing a blog using a modern JavaScript framework, chances are those requests are flowing through an API. Given the need for APIs to evolve through refactoring and extension, having great automated tests allows you to develop fast without needing to slow down to run manual tests to work out what’s broken. Additionally, by having tests in place you’re able to firmly identify the requirements that your API should meet, your API tests effectively form a tangible and executable specification. API Testing offers an end-to-end mechanism of testing the behaviour of your API which has advantages in both reliability and also development productivity.
In this post I'll be demonstrating how you can test RESTful APIs in an automated fashion using PHP, by building a testing framework through creative use of two packages - Guzzle and PHPUnit. The resulting tests will be something you can run outside of your API as part of your deployment or CI (Continuous Integration) process.
Guzzle acts as a powerful HTTP client which we can use to simulate HTTP Requests against our API. Though PHPUnit Continue reading
This piece was originally written for the Gopher Academy advent series. We are grateful to them for allowing us to republish it here.
Back when crypto/tls
was slow and net/http
young, the general wisdom was to always put Go servers behind a reverse proxy like NGINX. That's not necessary anymore!
At Cloudflare we recently experimented with exposing pure Go services to the hostile wide area network. With the Go 1.8 release, net/http
and crypto/tls
proved to be stable, performant and flexible.
However, the defaults are tuned for local services. In this articles we'll see how to tune and harden a Go server for Internet exposure.
crypto/tls
You're not running an insecure HTTP server on the Internet in 2016. So you need crypto/tls
. The good news is that it's now really fast (as you've seen in a previous advent article), and its security track record so far is excellent.
The default settings resemble the Intermediate recommended configuration of the Mozilla guidelines. However, you should still set PreferServerCipherSuites
to ensure safer and faster cipher suites are preferred, and CurvePreferences
to avoid unoptimized curves: a client using CurveP384
would cause up to a second of CPU to be consumed on our Continue reading
Cloudflare has an automatic image optimization feature called Polish, available to customers on paid plans. It recompresses images and removes unnecessary data so that they are delivered to browsers more quickly.
Up until now, Polish has not changed image types when optimizing (even if, for example, a PNG might sometimes have been smaller than the equivalent JPEG). But a new feature in Polish allows us to swap out an image for an equivalent image compressed using Google’s WebP format when the browser is capable of handling WebP and delivering that type of image would be quicker.
CC-BY 2.0 image by John Stratford
The main image formats used on the web haven’t changed much since the early days (apart from the SVG vector format, PNG was the last one to establish itself, almost two decades ago).
WebP is a newer image format for the web, proposed by Google. It takes advantage of progress in image compression techniques since formats such as JPEG and PNG were designed. It is often able to compress the images into a significantly smaller amount of data than the older formats.
WebP is versatile and able to replace the three main Continue reading
We use Salt to manage our ever growing global fleet of machines. Salt is great for managing configurations and being the source of truth. We use it for remote command execution and for network automation tasks. It allows us to grow our infrastructure quickly with minimal human intervention.
CC-BY 2.0 image by Kevin Dooley
We got to thinking. Are DNS records not just a piece of the configuration? We concluded that they are and decided to manage our own records from Salt too.
We are strong believers in eating our own dog food, so we make our employees use the next version of our service before rolling it to everyone else. That way if there's a problem visiting one of the 5 million websites that use Cloudflare it'll get spotted quickly internally. This is also why we keep our own DNS records on Cloudflare itself.
Cloudflare has an API that allows you to manage your zones programmatically without ever logging into the dashboard. Until recently, we were using handcrafted scripts to manage our own DNS records via our API. These scripts were in exotic languages like PHP for historical reasons and had interesting behavior that not everybody enjoyed. Continue reading
In 2011 we launched the Cloudflare Apps platform in an article that described Cloudflare as “not ... the sexiest business in the world.” Sexy or not, Cloudflare has since grown from the 3.5 billion pageviews a month we were doing then to over 1.3 trillion per month today. Along the way, we’ve powered more than a million app installations onto our customer’s websites.
For the last 6 years Cloudflare has been focused on building one of the world’s largest networks. The importance of that work has not left as much time as we would have liked to improve our app platform. With just 21 apps, we knew we were not delivering all that our marketplace could offer.
About six months ago, we were introduced to the team at Eager. Eager was building its own app store for installation onto any website. They impressed us with their ability to enable even the most non-technical website owner to install powerful tools to improve their sites through a slick interface. Eager’s platform included the features we wanted in our marketplace, like the ability to preview an app on a user's site before installing it. Even better, Eager had a powerful app Continue reading
Like most of you, I first heard of Cloudflare via this blog. I read about HTTP/2, Railgun, the Hundredth Data Center, and Keyless SSL — but I never thought I would work here. I, along with my co-founder Adam, and our friends and coworkers were hard at work building something very different. We were working on a tool which spent most of its life in the web browser, not on servers all around the world: an app store for your website. Using our tool a website owner could find and install any of over a hundred apps which could help them collect feedback from their visitors, sell products on their site, or even make their site faster.
Our goal was to create a way for every website owner to find and install all of the open-source and SaaS tools technical experts use everyday. As developers ourselves, we wanted to make it possible for a developer in her basement to build the next great tool and get it on a million websites (and make a million dollars) the next day. We didn’t want her to succeed because she had the biggest sales or marketing team, or the most Continue reading
M42 Smart Motorway in the West Midlands, UK; courtesy of Highways England.
The load time of your website not only affects your search engine rankings, but is also correlated to the conversion rate on your site:
Cloudflare is determined to help website administrators boost the performance of their websites. From today, Cloudflare users on our Business plan will gain a previously Enterprise-only Page Rule option, “Bypass Cache on Cookie”. When used in conjunction with a “Cache Everything” Page Rule, this setting allows for websites to cache the HTML of anonymous page visits without affecting dynamic content.
By caching anonymous page views, Cloudflare is able to help ensure that your origin webserver doesn't waste time constantly regenerating pages which change rarely. This ultimately allows us Continue reading
The following blog post describes a debugging adventure on Cloudflare's Mesos-based cluster. This internal cluster is primarily used to process log file information so that Cloudflare customers have analytics, and for our systems that detect and respond to attacks.
The problem encountered didn't have any effect on our customers, but did have engineers scratching their heads...
At some point in one of our cluster we started seeing errors like this (an NXDOMAIN for an existing domain on our internal DNS):
lookup some.existing.internal.host on 10.36.0.9:53: no such host
This seemed very weird, since the domain did indeed exist. It was one of our internal domains! Engineers had mentioned that they'd seen this behaviour, so we decided to investigate deeper. Queries triggering this error were varied and ranged from dynamic SRV records managed by mesos-dns to external domains looked up from inside the cluster.
Our first naive attempt was to run the following in a loop:
while true; do dig some.existing.internal.host > /tmp/dig.txt || break; done
Running this for a while on one server did not reproduce the problem: all the lookups were successful. Then we took our service Continue reading
Recent headline grabbing DDoS attacks provoked heated debates in the DNS community. Everyone has strong opinions on how to harden DNS to avoid downtime in the future. Is it better to use a single DNS provider or multiple? What DNS TTL values are best? Does DNSSEC make you more or less exposed?
CC BY 2.0 image by Leticia Chamorro
These are valid questions worth serious discussion, but tuning your own DNS server settings is not the full story. Together, as a community, we need to harden the DNS protocol itself. We need to prepare it to withstand the toughest DDoS attacks the future will surely bring. In this blog post I'll point out an obscure feature in the core DNS protocol. It is not practical to use this "hidden" feature for DDoS mitigation now, but with a small tweak it could become extremely useful. The feature is currently unused not due to protocol problems - it's unused because of the DNS Top Level Domain (TLD) operators' apathy. If it was working it would reduce DDoS recovery time for the DNS servers under attack.
The feature in question is: DNS TLD glue records. More specifically DNS TLD glue records with Continue reading
On Wednesday afternoon, Cloudflare and other Internet companies noticed that the West African country of The Gambia had dropped off the Internet - the day before the presidential election that was planned to be held there on Thursday, December 1st. This is not unprecedented. The Ugandan government blocked access to Facebook and WhatsApp during its recent election. Internet blocking by governments has also been seen in Gabon. Even Ghana toyed with the idea earlier this year.
Gambia has a population of 1.8 million people, and according to World Internet Stats, Internet penetration is growing fast and is almost 20%. The latest statistics indicate that at least ten percent of Gambians are using Facebook. As shown in the graph below, on Thursday, the Gambian government cut off access to the global Internet and for 39 hours hundreds of thousands of Gambians were unable to use online services on which they rely every day.
All the networks in Gambia disappeared from the global routing tables. This could have been caused by a soft reconfiguration of Internet routers; or by a physical powering down of telecommunications equipment. At this point, we do not know. What we do know is that we Continue reading
Back in March my colleague Marek wrote about a Winter of Whopping Weekend DDoS Attacks where we were seeing 400Gbps attacks occurring mostly at the weekends. We speculated that attackers were busy with something else during the week.
This winter we've seen a new pattern, and attackers aren't taking the week off, but they do seem to be working regular hours.
CC BY 2.0 image by Carol VanHook
On November 23, the day before US Thanksgiving, our systems detected and mitigated an attack that peaked at 172Mpps and 400Gbps. The attack started at 1830 UTC and lasted non-stop for almost exactly 8.5 hours stopping at 0300 UTC. It felt as if an attacker 'worked' a day and then went home.
The very next day the same thing happened again (although the attack started 30 minutes earlier at 1800 UTC).
On the third day the attacker started promptly at 1800 UTC but went home a little early at around 0130 UTC. But they managed to peak the attack over 200Mpps and 480Gbps.
And the attacker just kept this up day after day. Right through Thanksgiving, Black Friday, Cyber Monday and into this week. Night after night attacks were peaking Continue reading
Sometime before midnight Monday (UK local time) a ship dropped its anchor and broke, not one, not two, but three undersea cables serving the island of Jersey in the English Channel. Jersey is part of the Channel Islands along with Guernsey and some smaller islands.
Image courtesy TeleGeography Submarine Cable Map
These things happen and that’s not a good thing. The cut was reported on the venerable BBC news website. For the telecom operators in Jersey (JT Global) this wasn’t good news. However looking at the traffic from Cloudflare’s point of view; we can see that while the cable cut removed the direct path from London to Jersey, it was replaced by the backup path from Paris to Jersey. The move was 100% under the control of the BGP routing protocol. It's a relief that there's a fallback for when these unpredictable events happen.
Here's a look at one network on the island.
The red traffic is being served from our London data center (the normal location for all Jersey traffic) and the blue traffic is coming from our Paris data center. The step could well be caused by either a delayed break in one of the cables or the Continue reading
If you have experienced HTTP/2 for yourself, you are probably aware of the visible performance gains possible with HTTP/2 due to features like stream multiplexing, explicit stream dependencies, and Server Push.
There is however one important feature that is not obvious to the eye. This is the HPACK header compression. Current implementations of Apache and nginx servers, as well edge networks and CDNs using them, do not support the full HPACK implementation. We have, however, implemented the full HPACK in nginx, and upstreamed the part that performs Huffman encoding.
CC BY 2.0 image by Conor Lawless
This blog post gives an overview of the reasons for the development of HPACK, and the hidden bandwidth and latency benefits it brings.
As you probably know, a regular HTTPS connection is in fact an overlay of several connections in the multi-layer model. The most basic connection you usually care about is the TCP connection (the transport layer), on top of that you have the TLS connection (mix of transport/application layers), and finally the HTTP connection (application layer).
In the the days of yore, HTTP compression was performed in the TLS layer, using gzip. Both headers and body were compressed indiscriminately, Continue reading
Documentation for JavaScript projects has traditionally been generated via annotations inserted as code comments. While this gets the job done, it seems far from ideal. In this post, I’ll explore how to use TypeScript to generate documentation from source code alone.
CC BY-SA 2.0 image by David Joyner
TypeScript is JavaScript with optional types. Here’s a simple example:
// Sends some data to some analytics endpoint
function sendAnalyticsJS(data) {
if (typeof data.type !== 'string') {
throw new Error('The `type` property is required')
}
navigator.sendBeacon('/beacon', JSON.stringify(data))
}
// Results in run-time error
// The `type` property is required
sendAnalyticsJS({ foo: 'bar' })
The JavaScript code will result in a run-time error. This is fine if the developer catches it early, but it would be better if the developer were warned as the bug was introduced. Here’s the same code written using TypeScript:
// Describe the shape of the data parameter
interface IAnalyticsData {
// Type is required
type: string
// All other fields are fair game
[propName: string]: string
}
// We don’t particularly need the data.type check here since
// the compiler will stamp out the majority of those cases.
function sendAnalytics(data: IAnalyticsData) {
if Continue reading
It's 2016 and almost every site using Cloudflare (more than 4 million of them) is using IPv6. Because of this, Cloudflare sees significant IPv6 traffic globally where networks have enabled IPv6 to the consumer.
The top IPv6 networks are shown here.
The chart shows the percentage of IPv6 within a specific network vs. the relative bandwidth of that network. We will talk about specific networks below.
IPv6 is faster for two reasons. The first is that many major operating systems and browsers like iOS, MacOS, Chrome and Firefox impose anywhere from a 25ms to 300ms artificial delay on connections made over IPv4. The second is that some mobile networks won’t need to perform extra v4 -> v6 and v6 -> v4 translations to connect visitors to IPv6 enabled sites if the phone is only assigned an IPv6 address. (IPv6-only phones are becoming very common. If you have a phone on T-Mobile, Telstra, SK Telecom, Orange, or EE UK, to name a few, it’s likely you’re v6-only.)
How much faster is IPv6? Our data shows that visitors connecting over IPv6 were able to connect and load pages in 27% less time than visitors connecting Continue reading
Hola Barcelona! The land of Antoni Gaudí, Salvador Dalí, Ferrán Adriá and Lionel Messi is now also home to Cloudflare.
Located alongside the Mediterranean, Barcelona, is the capital of the autonomous community of Catalonia, the second-most populated municipality in Spain and the core of the fifth-most populous urban area in the Europe. Our data center in Barcelona is our 3rd in the Iberian Peninsula following our deployments in Madrid and Lisbon, our 28th in Europe, and 101st globally. This means not only better performance in Catalonia and Spain, but additional redundancy for European data centers. As of this moment, Cloudflare has a point of presence (PoP) in 7 out of Europe's 10 most populous urban areas, and with number 8 coming soon (all roads have been said to lead there…).
CC BY-NC-ND 2.0 image by Luc Mercelis
Cloudflare has connected to the Catalunya Neutral Internet Exchange (“CATNIX”). This raises the number of exchanges that Cloudflare is a participant in to over 150. As we expand our peering, more visitors get served locally. If you wish to connect with us at CATNIX (or any of our other locations) you can find our peering details on our PeeringDB profile.
Only a Continue reading