Archive

Category Archives for "Networking"

Using netlab for Classroom Training with Sander Steffann

In March 2024, I received my first PR from an airplane: Sander Steffann was flying to South Africa to deliver an Ansible training and fixed a minor annoyance in the then-new multilab feature.

Of course, I wanted to know more about his setup, but it took us over a year and a half till we managed to sit down (virtually) and chat about it, the state of IPv6, the impact of CG-NAT on fraud prevention, and why digital twins don’t make sense in large datacenter migrations.

For more details, listen to Episode 202 of Software Gone Wild.

Hedge 288: Loneliness

Sometimes we just like to talk about “life issues” as they relate to network engineering and technology career fields. Loneliness seems to especially plague network engineering and other “small” IT fields, where communities are small, change is rapid, and stress is high. Loneliness expert Lucy Rose joins Eyvonne, Tom, and Russ to discuss loneliness.

Check out the Cost of Loneliness here.

download

How We Measure: IPv6

At APNIC Labs we publish a number of measurements of the deployment of various technologies that are being adopted on the Internet. Here I will look at how we measure the adoption of IPv6.

Finding the grain of sand in a heap of Salt

How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?

That was the challenge we faced as we built the infrastructure to reduce release delays due to failures of Salt, a configuration management tool. (We eventually reduced such failures on the edge by over 5%, as we’ll explain below.) We’ll explore the fundamentals of Salt, and how it is used at Cloudflare. We then describe the common failure modes and how they delay our ability to release valuable changes to serve our customers.

By first solving an architectural problem, we provided the foundation for self-service mechanisms to find the root cause of Salt failures on servers, datacenters and groups of datacenters. This system is able to correlate failures with git commits, external service failures and ad hoc releases. The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.

To start, we will go into the basics of the Cloudflare network and how Salt operates within it. And then we’ll get to how we solved the challenge Continue reading

netlab 25.11: SRv6 on IOS/XE, Streamlined Graphs and Reports

I managed to push out netlab release 25.11 yesterday. Here are the highlights:

  • SRv6 on IOS/XE. It works with Catalyst 8000v, IOL, and IOL layer-2 image, and can be used to build L3VPNs (the IOS/XE image I have supports no other service on top of SRv6)
  • RIPv2/RIPng on OpenBSD
  • A more streamlined way to create reports and graphs
  • The netlab graph command creates the SVG/PNG/JPEG/PDF graph instead of a graph description file if you’ve installed D2/Graphviz on your system.

We also had to make a few potentially-breaking changes, fixed a bunch of bugs, and added over a dozen small improvements.

You’ll find all the details in the release notes.

NAN106: Unimus: Network Automation By and For Network Engineers (Sponsored)

Tomas Kirnak, CEO of Unimus, joins Eric Chou in this sponsored episode to introduce Unimus, an on-premise network configuration management system built by network engineers to solve real-world problems. In this deep dive they discuss Unimus’ proprietary “Behavioral Tree” for automatic device discovery, the platform’s vendor support, the 70/30 rule, and lowering the barrier for... Read more »

Connecting to production: the architecture of remote bindings

Remote bindings are bindings that connect to a deployed resource on your Cloudflare account instead of a locally simulated resource – and recently, we announced that remote bindings are now generally available

With this launch, you can now connect to deployed resources like R2 buckets and D1 databases while running Worker code on your local machine. This means you can test your local code changes against real data and services, without the overhead of deploying for each iteration. 

In this blog post, we’ll dig into the technical details of how we built it, creating a seamless local development experience.

Developing on the Workers platform

A key part of the Cloudflare Workers platform has been the ability to develop your code locally without having to deploy it every time you wanted to test something – though the way we’ve supported this has changed greatly over the years. 

We started with wrangler dev running in remote mode. This works by deploying and connecting to a preview version of your Worker that runs on Cloudflare’s network every time you make a change to your code, allowing you to test things out as you develop. However, remote mode isn’t perfect — Continue reading

Worth Reading: The Majority AI View

Many engineers who tried out (or use) various AI products would agree that they’re useful when used correctly, but way overhyped. However, as Anil Dash explains in his Majority AI View article, we rarely hear that opinion:

What’s amazing is the reality that virtually 100% of tech experts I talk to in the industry feel this way, yet nobody outside of that cohort will mention this reality.

HS116: Nth-Party Risk May Put You on the (Block) Chain Gang

The evolution of the modern, Internet-driven economy has created the conditions for essentially unbounded Nth-party risks (that is, risks from your suppliers, and risks from your suppliers’ suppliers, and risks from your suppliers’ suppliers’ suppliers, ad infinitum). Nth party risks exist in public clouds, SaaS, software and hardware supply chains, and now in the form... Read more »

One-Arm Hub-and-Spoke VPN on Arista EOS

In September 2024, I described how you can build One-Arm Hub-and-Spoke VPN with MPLS/VPN. In that blog post, I mentioned that the solution doesn’t work on Arista EOS because it allocates MPLS labels to whole VRFs (per-VRF label allocation).

In early September, I received an email from Daniel Blažek telling me that Arista fixed this particular annoyance in the EOS release 4.34.2F. It still uses per-VRF label allocation, but now, you can assign a different label to the default route. Let’s see how that works with our one-arm hub-and-spoke topology:

UET Data Transport Part I: Introduction

[Figure updated 13 November 2025]

My previous UET posts explained how an application uses libfabric function API calls to discover available hardware resources and how this information is used to create a hardware abstraction layer composed of Fabric, Domain, and Endpoint objects, along with their child objects — Event Queues, Completion Queues, Completion Counters, Address Vectors, and Memory Regions.

This chapter explains how these objects are used during data transfer operations. It also describes how information is encoded into UET protocol headers, including the Semantic Sublayer (SES) and Packet Delivery Sublayer (PDC). In addition, the chapter covers how the Congestion Management Sublayer (CMS) monitors and controls send queue rates to prevent egress buffer overflows.

Note: In this book, libfabric API calls are divided into two categories for clarity. Functions are used to create and configure fabric objects such as fabrics, domains, endpoints, and memory regions (for example, fi_fabric(), fi_domain(), and fi_mr_reg()). Operations, on the other hand, perform actual data transfer or synchronization between processes (for example, fi_write(), fi_read(), and fi_send()).

Figure 5-1 provides a high-level overview of a libfabric Remote Memory Access (RMA) operation using the fi_write function call. When an application needs to transfer data, such as gradients, from Continue reading

A closer look at Python Workflows, now in beta

Developers can already use Cloudflare Workflows to build long-running, multi-step applications on Workers. Now, Python Workflows are here, meaning you can use your language of choice to orchestrate multi-step applications.

With Workflows, you can automate a sequence of idempotent steps in your application with built-in error handling and retry behavior. But Workflows were originally supported only in TypeScript. Since Python is the de facto language of choice for data pipelines, artificial intelligence/machine learning, and task automation – all of which heavily rely on orchestration – this created friction for many developers.

Over the years, we’ve been giving developers the tools to build these applications in Python, on Cloudflare. In 2020, we brought Python to Workers via Transcrypt before directly integrating Python into workerd in 2024. Earlier this year, we built support for CPython along with any packages built in Pyodide, like matplotlib and pandas, in Workers. Now, Python Workflows are supported as well, so developers can create robust applications using the language they know best.

Why Python for Workflows?

Imagine you’re training an LLM. You need to label the dataset, feed data, wait for the model to run, evaluate the loss, adjust the model, and repeat. Without automation, Continue reading

Tailscale Welcomes Kubernetes Co-Founder Joe Beda as Advisor

Virtual Private Network (VPN) software provider Tailscale has brought on Kubernetes pioneer Brendan Burns (currently at Microsoft) and open source VPN software WireGuard, which provides an easy way to remotely connect to a network by way of VPN protocols. The company has parlayed the open source success of the code into an enterprise platform for running networks as well. Now, in an effort to expand its reach, Tailscale is looking to break into the cloud native Kubernetes market. The company has kicked off a number of initiatives to support Kubernetes networking in a production-scale facility. “Kubernetes networking has always been a bit of a challenge,” largely owing to its immense flexibility and ability to work in so many different environments, Beda said in an interview with TNS. Setting up the networking for a single cluster is easy enough, he said. But as the Continue reading

netlab: Test IPv6 IGP Deployment

Imagine you have an IPv4-only network1 and want to try out how to deploy a routing protocol for IPv6. netlab is a pretty good tool for the job as it:

  • Creates an addressing scheme for you
  • Designs a routing protocol deployment (OSPF, IS-IS) based on just a few bits of information
  • Deploys ready-to-run router configurations to a virtual lab.

Introducing Calico AI and Istio Ambient Mode

The Complexity of Modern Kubernetes Networking

Kubernetes has transformed how teams build and scale applications, but it has also introduced new layers of complexity. Platform and DevOps teams must now integrate and manage multiple technologies: CNI, ingress and egress gateways, service mesh, and more across increasingly large and dynamic environments. As more applications are deployed into Kubernetes clusters, the operational burden on these teams continues to grow, especially when maintaining performance, reliability, security, and observability across diverse workloads.

To address this complexity and tool sprawl, Tigera is incorporating Istio’s Ambient Service Mesh directly into the Calico Unified Network Security Platform. Service mesh has become the preferred solution for application-level networking, particularly in environments with a large number of services or highly regulated workloads. Among available service meshes, Istio stands out as the most popular and widely adopted, supported by a thriving open-source community. By leveraging the lightweight, sidecarless design of Istio Ambient Mode, Calico delivers all the benefits of service mesh, secure service-to-service communication, mTLS authentication, fine-grained authorization, traffic management, and observability, without the burden of sidecars.

Complementing this addition is Calico AI. Calico AI brings intelligence and automation to Kubernetes networking. It addresses the massive operational burden on teams Continue reading

1 4 5 6 7 8 3,476