Daniel Dib asked a sad question on LinkedIn:
Where did all the great documentation go?
In more detail:
There was a time when documentation answered almost all questions:
- What is the thing?
- What does the thing do?
- Why would you use the thing?
- How do you configure the thing?
I’ve seen the same thing happening in training, and here’s my cynical TL&DR answer: because the managers of the documentation/training departments don’t understand the true value of what they’re producing and thus cannot justify a decent budget to make it happen.
A few days ago, I described how you can use the new config.inline functionality to apply additional configuration commands to individual devices in a netlab-powered lab.
However, sometimes you have to apply the same set of commands to several devices. Although you could use device groups to do that, netlab release 25.09 offers a much better mechanism: you can embed custom configuration templates in the lab topology file.
[Updated 12-October, 2025: Figure & uet addressing section]
In libfabric and Ultra Ethernet Transport (UET), the endpoint, represented by the object fid_ep, serves as the primary communication interface between a process and the underlying network fabric. Every data exchange, whether it involves message passing, remote memory access (RMA), or atomic operations, ultimately passes through an endpoint. It acts as a software abstraction of the transport hardware, exposing a programmable interface that the application can use to perform high-performance data transfers.
Conceptually, an endpoint resembles a socket in the TCP/IP world. However, while sockets hide much of the underlying network stack behind a simple API, endpoints expose far more detail and control. They allow the process to define which completion queues to use, what capabilities to enable, and how multiple communication contexts are managed concurrently. This design gives applications, especially large distributed training frameworks and HPC workloads, direct control over latency, throughput, and concurrency in ways that traditional sockets cannot provide.
Furthermore, socket-based communication typically relies on the operating system’s networking stack and consumes CPU cycles for data movement and protocol handling. In contrast, endpoint communication paths can interact directly with the NIC, enabling user-space data transfers Continue reading

Netpicker brings together configuration backups, security and compliance checks, and automation in one place. It supports over 150 network vendors, including Cisco, Juniper, Arista, Palo Alto, Fortinet and many more. It also integrates well with other tools like Netbox, Nautobot, and Infrahub for inventory management, and Slurp’it for network discovery. All of these features, including backups, integrations, and automation, are available in the free version. The paid version adds features such as RBAC, approvals, detailed logging, support, and workflows for teams that need more control.
In our previous introductory post, we looked at what Netpicker is, how to set it up, and how it can back up configurations across multiple vendors. In this post, we'll focus on Netpicker Automation and how to use the Netpicker plugin with Netbox, so let's get to it.
Netpicker has partnered with me for this post, and they also support my blog as a sponsor.
This post assumes you already have a functioning Netpicker and NetBox instances. If you're completely new to Netpicker, check out the introductory post first, where we covered the basics like installation and initial setup. For this example, we’re using NetBox Community version 4.3. Continue reading
We network engineers often find ourselves without a viable plan–our plans always seem to go awry, to the point that many network engineers just give up on planning. Is “giving up” the right solution? Or can we learn to be better planners? Jonathan Adams and Tim McConnaughy join Russ to discussion planning for network engineers.
netlab release 25.10 includes:
You’ll find more details in the release notes.
Cloudforce One’s mission is to help defend the Internet. In Q2’25 alone, Cloudflare stopped an average of 190 billion cyber threats every single day. But real-world customer experiences showed us that stopping attacks at the edge isn’t always enough. We saw ransomware disrupt financial operations, data breaches cripple real estate firms, and misconfigurations cause major data losses.
In each case, the real damage occurred inside networks.
These internal breaches uncovered another problem: customers had to hand off incidents to separate internal teams for investigation and remediation. Those handoffs created delays and fractured the response. The result was a gap that attackers could exploit. Critical context collected at the edge didn’t reach the teams managing cleanup, and valuable time was lost. Closing this gap has become essential, and we recognized the need to take responsibility for providing customers with a more unified defense.
Today, Cloudforce One is launching a new suite of incident response and security services to help organizations prepare for and respond to breaches.
These services are delivered by Cloudforce One REACT (Respond, Evaluate, Assess, Consult Team), a group of seasoned responders and security veterans who investigate threats, hunt adversaries, and work closely with executive leadership to guide Continue reading
An engineer reading my multi-pod EVPN article asked an interesting question:
How do you handle troubleshooting when VTEPs cannot reach each other across pods?
The ancient Romans already knew the rough answer: divide and conquer.
In this particular case, the “divide” part starts with a simple realization: VXLAN/EVPN is just another application running on top of IP.
If you’re running Calico using manifests, you may have found that enabling the observability features introduced in version 3.30, including Whisker and Goldmane, requires a more hands-on approach. Earlier documentation focused on the Tigera operator, which automates key tasks such as certificate management and secure service configuration. In a manifest-based setup, these responsibilities shift to the user. While the process involves more manual steps, it provides greater transparency and control over each component. With the right guidance, setting up these observability tools is entirely achievable and offers valuable insight into the internal behavior of your Calico deployment.
We’ve heard from many of you in the Calico Slack community: you’re eager to try out Whisker and Goldmane but aren’t sure how to set them up without Helm or the operator. For anyone who’s up for a challenge, this blog post provides a step-by-step guide on how to get everything wired up the hard way.
However, even if you already use the operator, keep reading! We’re going to pull back the curtain on the magic it performs behind the scenes. Understanding these mechanics will help you troubleshoot, customize, and better appreciate a managed Continue reading
Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.
This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.
We run a service in our network which configures the kernel to handle traffic for some products like Magic Transit and Magic WAN. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.
We first saw one with a fatal error stating that traceback did not unwind completely. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.
And then it kept happening.