Labbing Network Technology Details with netlab

It’s been over four years since I published the last Software Gone Wild episode. In the meantime, I spent most of my time developing an open-source labbing tool, so it should be no surprise that the first post-hiatus episode focused on a netlab use case: how Ethan Banks (of the PacketPushers fame) is using the tool to quickly check the technology details for his N is for Networking podcast.

As expected, our discussion took us all over the place, including (according to Riverside AI):

Why IT-Site1 Can’t Ping OT_Site1R – Show and Tell Time #1

In my earlier blog post, Troubleshooting OT Security: Why IT-Site1 Can’t Ping OT_Site1R, we discovered the reason for this issue. Our “who done it” is simple. For security reasons, we are using Cisco TrustSec to keep them from communicating. Which... Read More ›

The post Why IT-Site1 Can’t Ping OT_Site1R – Show and Tell Time #1 appeared first on Networking with FISH.

Navigating DORA with Calico: Strengthening Kubernetes Operational Resilience in Financial Services

A single cyberattack or system outage can threaten not just one financial institution, but the stability of a vast portion of the entire financial sector. For today’s financial enterprises, securing dynamic infrastructure like Kubernetes is a core operational and regulatory challenge. The solution lies in achieving DORA compliance for Kubernetes, which transforms your cloud-native infrastructure into a resilient, compliant, and secure backbone for critical financial services.

The Challenge DORA Seeks to Solve

Before DORA (Digital Operational Resilience Act), rules for financial companies primarily focused on making sure they had enough financial capital to cover losses. But what if a cyberattack or tech failure brought a large part of the financial system down? Even with plenty of financial capital, a major outage could stop most operations and cause big problems for the whole financial market. DORA steps in to fix this. It’s all about making sure financial firms can withstand, respond to, and recover quickly from cyberattacks and other digital disruptions.

What is DORA?

The Digital Operational Resilience Act (DORA) is a European Union (EU) regulation that came into effect on January 17, 2025 and is designed to strengthen the security of financial entities. It establishes uniform requirements across the financial Continue reading

HW060: CNAE – A New Wired Networking Certification From the CWNP

The CWNP offers vendor-neutral certifications for wireless networking professionals. This summer, the organization rolled out a wired certification, the Certified Network Administrator and Engineer (CNAE). This cert is aimed at wired and wireless network engineers to ensure they have a solid grounding in switching, routing, cabling, and wired protocols. The CWNP says the cert isn’t... Read more »

PP077: News Roundup–Drift Breach Has Long Reach; FCC Investigates Its Own IoT Security Program

Is any publicity good publicity? On today’s News Roundup we talk about how Salesloft, which makes the Drift chat agent that’s been used as a jumping-off point for credential harvesting and data breach attacks against a bunch of big-name companies, is testing that proposition. We also discuss bugs affecting industrial refrigeration controllers, and Microsoft making... Read more »

Creating a Talos Linux Cluster on AWS with Pulumi, 2025 Edition

A little over two years ago, I wrote a post on creating a Talos Linux cluster on AWS using Pulumi. At the time of that post, the Pulumi provider for Talos was still a prerelease version. Since then, the Talos provider has undergone some notable changes necessitating an update to the example code I have on GitHub. For your reading pleasure, therefore, I present you with the 2025 edition of a tutorial for using Pulumi to create a Talos Linux cluster on AWS.

The updated Pulumi code can be found in this GitHub repository. Note that I’ve tagged the original version from the 2023 blog post with the “2023-post” tag, in the event you’d like to see the original code. While I chose to write my Pulumi code in Go, note that Pulumi supports a number of different languages (such as JavaScript/TypeScript, Python, one of the .NET languages, Java, or even YAML). I leave it as an exercise for the reader to re-implement this functionality in a different language. This Pulumi program is based on the Talos documentation for standing up a cluster on AWS.

The Pulumi program has four major sections:

  1. First, it creates the underlying base infrastructure Continue reading

IDC Makes Ebullient AI Spending Forecast Out To 2029

Ever since Nvidia reported its most recent financial results, where company co-founder and chief executive officer Jensen Huang said that there would be somewhere between $3 trillion and $4 trillion in spending on AI between now and the end of the decade, we have been on the prowl for any market research that backs up this claim or is its source.

IDC Makes Ebullient AI Spending Forecast Out To 2029 was written by Timothy Prickett Morgan at The Next Platform.

Bringing Node.js HTTP servers to Cloudflare Workers

We’re making it easier to run your Node.js applications on Cloudflare Workers by adding support for the node:http client and server APIs. This significant addition brings familiar Node.js HTTP interfaces to the edge, enabling you to deploy existing Express.js, Koa, and other Node.js applications globally with zero cold starts, automatic scaling, and significantly lower latency for your users — all without rewriting your codebase. Whether you're looking to migrate legacy applications to a modern serverless platform or build new ones using the APIs you already know, you can now leverage Workers' global network while maintaining your existing development patterns and frameworks.

The Challenge: Node.js-style HTTP in a Serverless Environment

Cloudflare Workers operate in a unique serverless environment where direct tcp connection isn't available. Instead, all networking operations are fully managed by specialized services outside the Workers runtime itself — systems like our Open Egress Router (OER) and Pingora that handle connection pooling, keeping connections warm, managing egress IPs, and all the complex networking details. This means as a developer, you don't need to worry about TLS negotiation, connection management, or network optimization — it's all handled for you automatically.

This fully-managed approach is actually why Continue reading

Measuring Explicit Congestion Notification

I've always thought that the TCP protocol was the heart of the ingenuity that is the Internet. TCP is a flow control protocol that strives to produce both fair efficient outcomes, both for each TCP session, but for the networ4k as well. The essential feedback signal used by TCP is congestion-based packet loss, but this isn't the only approach. Another approach lies in enrolling the aassistance of the network's routers, and use packet marking instead of loss to signal to the TCP end hosts that network congestion is taking place. This form of signalling, Explicit Congestion Notification is explored in this article, as well as the outcomes of a measurement exercise to establish just how widely we used this alternativer form of congestion signalling.

netlab 25.09: IPv6 RA, Link Impairments, and Performance Gains

netlab release 25.09 includes:

  • Link impairment (implemented with Linux netem queuing discipline) defined in lab topology or configured/controlled with the netlab tc command
  • Configurable IPv6 Router Advertisement parameters
  • The files plugin to store the content of short files (including custom configuration templates) directly in the lab topology
  • Support for Nokia SR-OS container (SR-SIM)
  • Support for very large topologies (tested so far: approximately 3000 lab devices)

But wait, there’s more (as always):

Ultra Ethernet: Fabric Setup

Introduction: Job Environment Initialization

Distributed AI training requires careful setup of both hardware and software resources. In a UET-based system, the environment initialization proceeds through several key phases, each ensuring that GPUs, network interfaces, and processes are correctly configured before training begins:


1. Fabric Endpoint (FEP) Creation

Each GPU process is associated with a logical Fabric Endpoint (FEP) that abstracts the connection to its NIC port. FEPs, together with the connected switch ports, form a Fabric Plane (FP)—an isolated, high-performance data path. The NICs advertise their capabilities via LLDP messages to ensure compatibility and readiness.

2. Vendor UET Provider Publication

Once FEPs are created, they are published to the Vendor UET Provider, which exposes them as Libfabric domains. This step makes the Fabric Addresses (FAs) discoverable, but actual communication objects (endpoints, address vectors) are created later by the application processes. This abstraction ensures consistent interaction with the hardware regardless of vendor-specific implementations.

3. Job Launcher and Environment Variables

When a distributed training job is launched, the job launcher (e.g., Torchrun) sets up environment variables for each process. These include the master rank IP and port, local and global ranks, and the total number of processes.

4. Environment Variable Continue reading

Ideal programming language

My last post about Go got some attention.

In fact, two of my posts got attention that day, which broke my nginx since I was running livecount behind nginx, making me run out of file descriptors when thousands of people had the page opened.

It’s a shame that I had to turn off livecount, since it’d be cool to see the stats. But I was out of the country, with unreliable access to both Internet and even electricity in hotels, so I couldn’t implement the real fix until I got back, when it had already mostly died down.

traffic

I knew this was a problem with livecount, of course, and I even allude to it in its blog post.

Anyway, back to programming languages.

The reactions to my post can be summarized as:

  • Oh yes, these things are definite flaws in the language.
  • What you’re saying is true, but it’s not a problem. Your post is pointless.
  • You’re dumb. You don’t understand Go. Here let me explain your own blog post to you […]

I respect the first two. The last one has to be from people who are too emotionally invested with their tools, and take articles like this Continue reading

Linux For Network Engineers (LFNE) – AlmaLinux & Alpine Editions

After the release of the Ubuntu 24.04 edition of Linux For Network Engineers (LFNE) I’ve got some questions from the community. Here are two new flavors of LFNE based on your requests. LFNE AlmaLinux 10 OS For Red Hat fans who prefer a RHEL-style environment. Since CentOS is no longer maintained, AlmaLinux is the closest […]

<p>The post Linux For Network Engineers (LFNE) – AlmaLinux & Alpine Editions first appeared on IPNET.</p>

Broadcom Lands Shepherding Deal For OpenAI “Titan” XPU

Broadcom turned in its financial results for its third quarter last night, and all of the tongues in the IT sector are wagging about how the chip maker and enterprise software giant has landed a fourth customer for its burgeoning custom XPU design and shepherding business.

Broadcom Lands Shepherding Deal For OpenAI “Titan” XPU was written by Timothy Prickett Morgan at The Next Platform.

HPE Systems Rebound As Juniper Brings A Further Boost

Sales of GPU-accelerated servers are still hurting margins at Hewlett Packard Enterprise, as they are doing at all OEMs and probably the ODMs, too, but the good news is that they will be hurting less and less as sales of beefier and more profitable general purpose servers are on the rise and as sovereign clouds and neoclouds turn to HPE for iron and pay higher unit prices for gear.

HPE Systems Rebound As Juniper Brings A Further Boost was written by Timothy Prickett Morgan at The Next Platform.