TNO028: Move From Monitoring to Full Internet Stack Observability: New Strategies for NetOps (Sponsored)

Network monitoring, Internet monitoring, and observability are all key components of NetOps. We speak with sponsor Catchpoint to understand how Catchpoint can help network operators proactively identify and resolve issues before they impact customers. We discuss past and current network monitoring strategies and the challenges that operators face with both on-prem and cloud monitoring, along... Read more »

TL013: The Process Communication Model: An Algorithm for Effective Communication

On this episode of Technically Leadership, we’re joined by Aleksandra Lemańska to learn about the Process Communication Model (PCM), a framework for enhancing communication. Alex calls PCM an algorithm for people, and it can be useful for improving interactions with engineers and technical folks operating in high-stress environments. We talk about how PCM works, understanding... Read more »

First-party tags in seconds: Cloudflare integrates Google tag gateway for advertisers

If you’re a marketer, advertiser, or a business owner that runs your own website, there’s a good chance you’ve used Google tags in order to collect analytics or measure conversions. A Google tag is a single piece of code you can use across your entire website to send events to multiple destinations like Google Analytics and Google Ads. 

Historically, the common way to deploy a Google tag meant serving the JavaScript payload directly from Google’s domain. This can work quite well, but can sometimes impact performance and accurate data measurement. That’s why Google developed a way to deploy a Google tag using your own first-party infrastructure using server-side tagging. However, this server-side tagging required deploying and maintaining a separate server, which comes with a cost and requires maintenance.

That’s why we’re excited to be Google’s launch partner and announce our direct integration of Google tag gateway for advertisers, providing many of the same performance and accuracy benefits of server-side tagging without the overhead of maintaining a separate server.  

Any domain proxied through Cloudflare can now serve your Google tags directly from that domain. This allows you to get better measurement signals for your website and can enhance your Continue reading

Supermicro Hiccups On Hopper, Pulls $40 Billion Guidance For Fiscal 2026

Nvidia co-founder and chief executive officer Jensen Huang did not do his OEM and ODM partners, who are the company’s main route to bring the infrastructure underpinning GPU systems to market, any favors when he suggested its “Hopper” GPU platforms would be blown away by their “Blackwell” kickers.

Supermicro Hiccups On Hopper, Pulls $40 Billion Guidance For Fiscal 2026 was written by Timothy Prickett Morgan at The Next Platform.

Forwarding Packets Across a Network

After inspecting the confusing bridging/routing/switching terminology and a brief detour into the control/data plane details, let’s talk about how packets actually move across a network.

As always, things were simpler when networks were implemented with a single cable. In that setup, all nodes were directly reachable, and the only challenge was figuring out the destination node’s address; it didn’t matter whether it was a MAC address, an IP address, or a Fiber Channel address. On a single cable, you could just broadcast, like, “Who has this service?” and someone would reply, “I’m the printer you’re looking for.” That’s how many early non-IP protocols operated.

Resilience in the RPKI

I would like to look at the ways in which the operators of the number Resource Public Key Infrastructure (RPKI) have deployed this infrastructure in a way that maximises its available and performance and hardens it against potential service interruptions, or in other words, an examination of the resilience of the RPKI infrastructure.

QUIC restarts, slow problems: udpgrm to the rescue

At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.

We've previously written about graceful restarts in the context of TCP, which is much easier to handle. We didn't have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces udpgrm, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.

Here's the udpgrm GitHub repo.

Historical context

In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is Continue reading

Screen Scraping in 2025

Dr. Tony Przygienda left a very valid (off-topic) comment to my Breaking APIs or Data Models Is a Cardinal Sin blog post:

If, on the other hand, the customers would not camp for literally tens of years on regex scripts scraping screens, lots of stuff could progress much faster.

He’s right, particularly from Juniper’s perspective; they were the first vendor to use a data-driven approach to show commands. Unfortunately, we’re still not living in a perfect world:

Multi-vendor support for dropped packet notifications


The sFlow Dropped Packet Notification Structures extension was published in October 2020. Extending sFlow to provide visibility into dropped packets offers significant benefits for network troubleshooting, providing real-time network wide visibility into the specific packets that were dropped as well the reason the packet was dropped. This visibility instantly reveals the root cause of drops and the impacted connections. Packet discard records complement sFlow's existing counter polling and packet sampling mechanisms and share a common data model so that all three sources of data can be correlated, for example, packet sampling reveals the top consumers of bandwidth on a link, helping to get to the root cause of congestion related packet drops reported for the link.

Today the following network operating systems include support for the drop notification extension in their sFlow agent implementations:

Two additional sFlow dropped packet notification implementations are in the pipeline and should be available later this year:

CNCF and Synadia Reach an Agreement on NATS

Last month, Synadia, the primary maintainer of the NATS messaging system, tried to withdraw NATS from the open source governance of Cloud Native Computing Foundation (CNCF). Its motive was to try to profit from NATS by Synadia had previously donated NATS to the Cloud Native Computing Foundation (CNCF) in 2018. Now, the Cloud Native Computing Foundation (CNCF) and NATS project will continue in the CNCF’s cloud native open source ecosystem with Synadia’s continued support and involvement. A spokesperson for Synadia did not immediately respond to a TNS request for comment. Not So Fast Synadia had planned to regain control of the

Scaling with safety: Cloudflare’s approach to global service health metrics and software releases

Has your browsing experience ever been disrupted by this error page? Sometimes Cloudflare returns "Error 500" when our servers cannot respond to your web request. This inability to respond could have several potential causes, including problems caused by a bug in one of the services that make up Cloudflare's software stack.

We know that our testing platform will inevitably miss some software bugs, so we built guardrails to gradually and safely release new code before a feature reaches all users. Health Mediated Deployments (HMD) is Cloudflare’s data-driven solution to automating software updates across our global network. HMD works by querying Thanos, a system for storing and scaling Prometheus metrics. Prometheus collects detailed data about the performance of our services, and Thanos makes that data accessible across our distributed network. HMD uses these metrics to determine whether new code should continue to roll out, pause for further evaluation, or be automatically reverted to prevent widespread issues.

Cloudflare engineers configure signals from their service, such as alerting rules or Service Level Objectives (SLOs). For example, the following Service Level Indicator (SLI) checks the rate of HTTP 500 errors over 10 minutes returned from a service in our software stack.

sum(rate(http_request_count{code="500"}[10m]))  Continue reading

IS-IS Behavior on Multi-Access Networks

In many ways, IS-IS is a simpler, and perhaps more elegant, routing protocol than OSPF. However, it often gets misunderstood. Perhaps due to its roots in OSI or perhaps because it’s not as widely deployed as OSPF. Some of the confusion come from how it behaves on multi-access networks. What is a Designated Intermediate System (DIS)? Why do we need a pseudonode? How do we flood Link State PDUs (LSPs)? In this post, I’ll cover all of that and more. This is going to be a deep dive so save this blog for when you have some time to focus.

IS-IS is a link state protocol, which means that we need to build a link state database that describes how all the intermediate systems (routers) are interconnected, and what prefixes they are associated with. Simply put, we need to build a graph. Let’s do a quick recap of graph theory.

A graph consists of vertices (nodes) and edges (links). When referring to a single node, it’s called a vertex. This is shown below:

There are different types of graphs. They can be undirected and unweighted:

With this type of graph, there is no weight assigned and there is no way Continue reading