Marek Majkowski

Author Archives: Marek Majkowski

The revenge of the listening sockets

Back in November we wrote a blog post about one latency spike. Today I'd like to share a continuation of that story. As it turns out, the misconfigured rmem setting wasn't the only source of added latency.

It looked like Mr Wolf hadn't finished his job.


After adjusting the previously discussed rmem sysctl we continued monitoring our systems' latency. Among other things we measured ping times to our edge servers. While the worst case improved and we didn't see 1000ms+ pings anymore, the line still wasn't flat. Here's a graph of ping latency between an idling internal machine and a production server. The test was done within the datacenter, the packets never went to the public internet. The Y axis of the chart shows ping times in milliseconds, the X axis is the time of the measurement. Measurements were taken every second for over 6 hours:

As you can see most pings finished below 1ms. But out of 21,600 measurements about 20 had high latency of up to 100ms. Not ideal, is it?

System tap

The latency occurred within our datacenter and the packets weren't lost. This suggested a kernel issue again. Linux responds to ICMP pings from its soft Continue reading

400Gbps: Winter of Whopping Weekend DDoS Attacks

Over the last month, we’ve been watching some of the largest distributed denial of service (DDoS) attacks ever seen unfold. As CloudFlare has grown we've brought on line systems capable of absorbing and accurately measuring attacks. Since we don't need to resort to crude techniques to block traffic we can measure and filter attacks with accuracy. Our systems sort bad packets from good, keep websites online and keep track of attack packet rates and bits per second.

The current spate of large attacks are all layer 3 (L3) DDoS. Layer 3 attacks consist of a large volume of packets hitting the target network, and the aim is usually to overwhelm the target network hardware or connectivity.

L3 attacks are dangerous because most of the time the only solution is to acquire large network capacity and buy beefy networking hardware, which is simply not an option for most independent website operators. Or, faced with huge packet rates, some providers simply turn off connections or entirely block IP addresses.

A Typical Day At CloudFlare

Historically, L3 attacks were the biggest headache for CloudFlare. Over the last two years, we’ve automated almost all of our L3 attack handling and these automatic systems protect Continue reading

We’re hosting a Null Singapore meetup!

We're happy to announce that next week CloudFlare is hosting the Null Security meetup in Singapore. You are invited!

Null is a community for hackers and security enthusiasts. Monthly meetups are organized in a number of Asian cities. Read more at http://null.co.in/.

The lineup for the February meetup:

  • All you ever wanted to know about DDoS attacks Marek Majkowski
  • Security News Bytes Drupan Chandarana
  • DNS Hijacking Michael Smith

If you’d like to sign up for the event, you can do so here:

What: Null Singapore - The Open Security Community meetup

When: February 24th: 6:45pm-8:45pm

Where: The Working Capitol, "The Commons" Room, 1 Keong Saik Road, Singapore 089109

Registration is required

CloudFlare is actively hiring in Singapore!

The story of one latency spike

A customer reported an unusual problem with our CloudFlare CDN: our servers were responding to some HTTP requests slowly. Extremely slowly. 30 seconds slowly. This happened very rarely and wasn't easily reproducible. To make things worse all our usual monitoring hadn't caught the problem. At the application layer everything was fine: our NGINX servers were not reporting any long running requests.

Time to send in The Wolf.

He solves problems.

Following the evidence

First, we attempted to reproduce what the customer reported—long HTTP responses. Here is a chart of of test HTTP requests time measured against our CDN:

We ran thousands of HTTP queries against one server over a couple of hours. Almost all the requests finished in milliseconds, but, as you can clearly, see 5 requests out of thousands took as long as 1000ms to finish. When debugging network problems the delays of 1s, 30s are very characteristic. They may indicate packet loss since the SYN packets are usually retransmitted at times 1s, 3s, 7s, 15, 31s.

Blame the network

At first we thought the spikes in HTTP load times might indicate some sort of network problem. To be sure we ran ICMP pings against two IPs over many Continue reading

Mobile Ad Networks as DDoS Vectors: A Case Study

CloudFlare servers are constantly being targeted by DDoS'es. We see everything from attempted DNS reflection attacks to L7 HTTP floods involving large botnets.

Recently an unusual flood caught our attention. A site reliability engineer on call noticed a large number of HTTP requests being issued against one of our customers.

The request

Here is one of the requests:

POST /js/404.js HTTP/1.1  
Host: www.victim.com  
Connection: keep-alive  
Content-Length: 426  
Origin: http://attacksite.com  
User-Agent: Mozilla/5.0 (Linux; U; Android 4.4.4; zh-cn; MI 4LTE Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.0.0 Mobile Safari/537.36 XiaoMi/MiuiBrowser/2.1.1  
Content-Type: application/x-www-form-urlencoded  
Accept: */*  
Referer: http://attacksite.com/html/part/86.html  
Accept-Encoding: gzip, deflate  
Accept-Language: zh-CN,en-US;q=0.8

id=datadatadasssssssssssssssssssssssssssssssssssssssssssassssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssadatadata  

We received millions of similar requests, clearly suggesting a flood. Let's take a deeper look at this request.

First, let's note that the headers look legitimate. We often see floods issued by Python or Ruby scripts, with weird Accept-Language or User-Agent headers. But this one doesn't look like it. This request is a proper request issued by a real browser.

Next, notice the request is a POST and contains an Origin header — it was issued by an Ajax (XHR) cross Continue reading

Kernel bypass

In two previous posts we've discussed how to receive 1M UDP packets per second and how to reduce the round trip time. We did the experiments on Linux and the performance was very good considering it's a general purpose operating system.

Unfortunately the speed of vanilla Linux kernel networking is not sufficient for more specialized workloads. For example, here at CloudFlare, we are constantly dealing with large packet floods. Vanilla Linux can do only about 1M pps. This is not enough in our environment, especially since the network cards are capable of handling a much higher throughput. Modern 10Gbps NIC's can usually process at least 10M pps.

hispeed.jpg CC BY 2.0 image by Tony Webster

It's apparent that the only way to squeeze more packets from our hardware is by working around the Linux kernel networking stack. This is called a "kernel bypass" and in this article we'll dig into various ways of achieving it.

The kernel is insufficient

Let's prepare a small experiment to convince you that working around Linux is indeed necessary. Let's see how many packets can be handled by the kernel under perfect conditions. Passing packets to userspace is costly, so instead let's try to drop Continue reading

How to achieve low latency with 10Gbps Ethernet

Good morning!

In a recent blog post we explained how to tweak a simple UDP application to maximize throughput. This time we are going to optimize our UDP application for latency. Fighting with latency is a great excuse to discuss modern features of multiqueue NICs. Some of the techniques covered here are also discussed in the scaling.txt kernel document.

CC BY-SA 2.0 image by Xiaojun Deng

Our experiment will be setup up as follows:

  • We will have two physical Linux hosts: the 'client' and the 'server'. They communicate with a simple UDP echo protocol.
  • Client sends a small UDP frame (32 bytes of payload) and waits for the reply, measuring the round trip time (RTT). Server echoes back the packets immediately after they are received.
  • Both hosts have 2GHz Xeon CPU's, with two sockets of 6 cores and Hyper Threading (HT) enabled - so 24 CPUs per host.
  • The client has a Solarflare 10Gb NIC, the server has an Intel 82599 10Gb NIC. Both cards have fiber connected to a 10Gb switch.
  • We're going to measure the round trip time. Since the numbers are pretty small, there is a lot of jitter when counting the averages. Instead, it Continue reading

How to receive a million packets per second

Last week during a casual conversation I overheard a colleague saying: "The Linux network stack is slow! You can't expect it to do more than 50 thousand packets per second per core!"

That got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what is the Linux networking stack capable of? Let's rephrase that to make it more fun:

On Linux, how hard is it to write a program that receives 1 million UDP packets per second?

Hopefully, answering this question will be a good lesson about the design of a modern networking stack.

CC BY-SA 2.0 image by Bob McCaffrey

First, let us assume:

  • Measuring packets per second (pps) is much more interesting than measuring bytes per second (Bps). You can achieve high Bps by better pipelining and sending longer packets. Improving pps is much harder.

  • Since we're interested in pps, our experiments will use short UDP messages. To be precise: 32 bytes of UDP payload. That means 74 bytes on the Ethernet layer.

  • For the experiments we will use two physical servers: "receiver" and "sender".

  • They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled Continue reading

CloudFlare “Interview Questions”

For quite some time we've been grilling our candidates about dirty corners of TCP/IP stack. Every engineer here must prove his/her comprehensive understanding of the full network stack. For example: what are the differences in checksumming algorithms between IPv4 and IPv6 stacks?

I'm joking of course, but in the spirit of the old TCP/IP pub game I want to share some of the amusing TCP/IP quirks I've bumped into over the last few months while working on CloudFlare's automatic attack mitigation systems.

CC BY-SA 2.0 image by Daan Berg

Don't worry if you don't know the correct answer: you may always come up with a funny one!

Some of the questions are fairly obvious, some don't have a direct answer and are supposed to provoke a longer discussion. The goal is to encourage our readers to review the dusty RFCs, get interested in the inner workings of the network stack and generally spread the knowledge about the protocols we rely on so much.

Don't forget to add a comment below if you want to share a response!

You think you know all about TCP/IP? Let's find out.

Archaeology

1) What is the lowest TCP port number?

2) The TCP Continue reading

Deprecating the DNS ANY meta-query type

DNS, one of the oldest technologies running the Internet, keeps evolving. There is a constant stream of new developments, from DNSSEC, through DNS-over-TLS, to a plentiful supply of fresh EDNS extensions.

penguins

CC BY-ND 2.0 image by Antarctica Bound

New DNS Resource Records types are being added all the time. As the Internet evolves, new RR’s gain traction while the usage of some old record types decreases. Did you know you can use DNS to express the location of your server on the planet's surface?

Today, we are announcing that we are deprecating the DNS ANY meta-query. In a few weeks we'll be responding to those queries with rcode 4 / Not Implemented.

“ANY” is one of the special “magic” types in DNS. Instead of being a query for a single type like A , AAAA or MX, ANY retrieves all the available types for a given name. Over the years there have been many arguments over the semantics of ANY with some people arguing it really means ALL. Answers to ANY queries are among the biggest that DNS servers give out. The original reason for adding the ANY to DNS was to aid in debugging and testing Continue reading

Path MTU discovery in practice

Last week, a very small number of our users who are using IP tunnels (primarily tunneling IPv6 over IPv4) were unable to access our services because a networking change broke "path MTU discovery" on our servers. In this article, I'll explain what path MTU discovery is, how we broke it, how we fixed it and the open source code we used.

Tunnel

source

First there was the fragmentation

When a host on the Internet wants to send some data, it must know how to divide the data into packets. And in particular it needs to know the maximum size of packet. The maximum size of a packet a host can send is called Maximum Transmission Unit: MTU.

The longer the MTU, the better for performance, but the worse for reliability, because a lost packet means more data to be retransmitted and because many routers on the Internet can't deliver very long packets.

The fathers of the Internet assumed that this problem would be solved at the IP layer with IP fragmentation. Unfortunately IP fragmentation has serious disadvantages and it's avoided in practice.

Do-not-fragment bit

To work around fragmentation problems the IP layer contains a "Don't Fragment" bit on every IP packet. Continue reading