Anbang Wen

Author Archives: Anbang Wen

How Rust and Wasm power Cloudflare’s 1.1.1.1

How Rust and Wasm power Cloudflare's 1.1.1.1
How Rust and Wasm power Cloudflare's 1.1.1.1

On April 1, 2018, Cloudflare announced the 1.1.1.1 public DNS resolver. Over the years, we added the debug page for troubleshooting, global cache purge, 0 TTL for zones on Cloudflare, Upstream TLS, and 1.1.1.1 for families to the platform. In this post, we would like to share some behind the scenes details and changes.

When the project started, Knot Resolver was chosen as the DNS resolver. We started building a whole system on top of it, so that it could fit Cloudflare's use case. Having a battle tested DNS recursive resolver, as well as a DNSSEC validator, was fantastic because we could spend our energy elsewhere, instead of worrying about the DNS protocol implementation.

Knot Resolver is quite flexible in terms of its Lua-based plugin system. It allowed us to quickly extend the core functionality to support various product features, like DoH/DoT, logging, BPF-based attack mitigation, cache sharing, and iteration logic override. As the traffic grew, we reached certain limitations.

Lessons we learned

Before going any deeper, let’s first have a bird’s-eye view of a simplified Cloudflare data center setup, which could help us understand what we are going to talk Continue reading

Unwrap the SERVFAIL

Unwrap the SERVFAIL

We recently released a new version of Cloudflare Resolver which adds a piece of information called “Extended DNS Errors” (EDE) along with the response code under certain circumstances. This will be helpful in tracing DNS resolution errors and figuring out what went wrong behind the scenes.

Unwrap the SERVFAIL
(image from: https://www.pxfuel.com/en/free-photo-expka)

A tight-lipped agent

The DNS protocol was designed to map domain names to IP addresses. To inform the client about the result of the lookup, the protocol has a 4 bit field, called response code/RCODE. The logic to serve a response might look something like this:

function lookup(domain) {
    ...
    switch result {
    case "No error condition":
        return NOERROR with client expected answer
    case "No record for the request type":
        return NOERROR
    case "The request domain does not exist":
        return NXDOMAIN
    case "Refuse to perform the specified operation for policy reasons":
        return REFUSE
    default("Server failure: unable to process this query due to a problem with the name server"):
        return SERVFAIL
    }
}

try {
    lookup(domain)
} catch {
    return SERVFAIL
}

Although the context hasn't changed much, protocol extensions such as DNSSEC have been added, which makes the RCODE run out of space to express the server's internal Continue reading