Whenever you see Gerhard Stein and Thomas Weible from Flexoptix in a list of presenters, three things immediately become obvious:
Their SwiNOG 40 presentation (video) met all three expectation. We learned how well transceivers cope with high temperatures and what happens when you try to melt them with a heat gun.
A little over two years ago, I wrote a post on creating a Talos Linux cluster on AWS using Pulumi. At the time of that post, the Pulumi provider for Talos was still a prerelease version. Since then, the Talos provider has undergone some notable changes necessitating an update to the example code I have on GitHub. For your reading pleasure, therefore, I present you with the 2025 edition of a tutorial for using Pulumi to create a Talos Linux cluster on AWS.
The updated Pulumi code can be found in this GitHub repository. Note that I’ve tagged the original version from the 2023 blog post with the “2023-post” tag, in the event you’d like to see the original code. While I chose to write my Pulumi code in Go, note that Pulumi supports a number of different languages (such as JavaScript/TypeScript, Python, one of the .NET languages, Java, or even YAML). I leave it as an exercise for the reader to re-implement this functionality in a different language. This Pulumi program is based on the Talos documentation for standing up a cluster on AWS.
The Pulumi program has four major sections:
Ever since Nvidia reported its most recent financial results, where company co-founder and chief executive officer Jensen Huang said that there would be somewhere between $3 trillion and $4 trillion in spending on AI between now and the end of the decade, we have been on the prowl for any market research that backs up this claim or is its source. …
IDC Makes Ebullient AI Spending Forecast Out To 2029 was written by Timothy Prickett Morgan at The Next Platform.
We’re making it easier to run your Node.js applications on Cloudflare Workers by adding support for the node:http
client and server APIs. This significant addition brings familiar Node.js HTTP interfaces to the edge, enabling you to deploy existing Express.js, Koa, and other Node.js applications globally with zero cold starts, automatic scaling, and significantly lower latency for your users — all without rewriting your codebase. Whether you're looking to migrate legacy applications to a modern serverless platform or build new ones using the APIs you already know, you can now leverage Workers' global network while maintaining your existing development patterns and frameworks.
Cloudflare Workers operate in a unique serverless environment where direct tcp connection isn't available. Instead, all networking operations are fully managed by specialized services outside the Workers runtime itself — systems like our Open Egress Router (OER) and Pingora that handle connection pooling, keeping connections warm, managing egress IPs, and all the complex networking details. This means as a developer, you don't need to worry about TLS negotiation, connection management, or network optimization — it's all handled for you automatically.
This fully-managed approach is actually why Continue reading
netlab release 25.09 includes:
But wait, there’s more (as always):
Distributed AI training requires careful setup of both hardware and software resources. In a UET-based system, the environment initialization proceeds through several key phases, each ensuring that GPUs, network interfaces, and processes are correctly configured before training begins:
1. Fabric Endpoint (FEP) Creation
Each GPU process is associated with a logical Fabric Endpoint (FEP) that abstracts the connection to its NIC port. FEPs, together with the connected switch ports, form a Fabric Plane (FP)—an isolated, high-performance data path. The NICs advertise their capabilities via LLDP messages to ensure compatibility and readiness.
2. Vendor UET Provider Publication
Once FEPs are created, they are published to the Vendor UET Provider, which exposes them as Libfabric domains. This step makes the Fabric Addresses (FAs) discoverable, but actual communication objects (endpoints, address vectors) are created later by the application processes. This abstraction ensures consistent interaction with the hardware regardless of vendor-specific implementations.
3. Job Launcher and Environment Variables
When a distributed training job is launched, the job launcher (e.g., Torchrun) sets up environment variables for each process. These include the master rank IP and port, local and global ranks, and the total number of processes.
4. Environment Variable Continue reading
My last post about Go got some attention.
In fact, two of my posts got attention that day, which broke my nginx since I was running livecount behind nginx, making me run out of file descriptors when thousands of people had the page opened.
It’s a shame that I had to turn off livecount, since it’d be cool to see the stats. But I was out of the country, with unreliable access to both Internet and even electricity in hotels, so I couldn’t implement the real fix until I got back, when it had already mostly died down.
I knew this was a problem with livecount, of course, and I even allude to it in its blog post.
Anyway, back to programming languages.
The reactions to my post can be summarized as:
I respect the first two. The last one has to be from people who are too emotionally invested with their tools, and take articles like this Continue reading
After the release of the Ubuntu 24.04 edition of Linux For Network Engineers (LFNE) I’ve got some questions from the community. Here are two new flavors of LFNE based on your requests. LFNE AlmaLinux 10 OS For Red Hat fans who prefer a RHEL-style environment. Since CentOS is no longer maintained, AlmaLinux is the closest […]
<p>The post Linux For Network Engineers (LFNE) – AlmaLinux & Alpine Editions first appeared on IPNET.</p>
Broadcom turned in its financial results for its third quarter last night, and all of the tongues in the IT sector are wagging about how the chip maker and enterprise software giant has landed a fourth customer for its burgeoning custom XPU design and shepherding business. …
Broadcom Lands Shepherding Deal For OpenAI “Titan” XPU was written by Timothy Prickett Morgan at The Next Platform.
Returning to a thread here at the Hedge, Rick Graziani joins Tom and Russ to discuss a college professor’s perspective on why network engineers should learn the theory, and not just the configuration.
Sales of GPU-accelerated servers are still hurting margins at Hewlett Packard Enterprise, as they are doing at all OEMs and probably the ODMs, too, but the good news is that they will be hurting less and less as sales of beefier and more profitable general purpose servers are on the rise and as sovereign clouds and neoclouds turn to HPE for iron and pay higher unit prices for gear. …
HPE Systems Rebound As Juniper Brings A Further Boost was written by Timothy Prickett Morgan at The Next Platform.
When Kubernetes workloads need to connect to the outside world, whether to access external APIs, integrate with external systems, or connect to partner networks, they often face a unique challenge. The problem? Pod IP addresses inside Kubernetes clusters are dynamic and non-routable. For external systems to recognize and trust this traffic, workloads need a consistent, dependable identity. This means outbound connections require fixed, routable IP addresses that external services can rely on. This is where Network Address Translation (NAT) becomes essential. It assigns Kubernetes pods with a static, consistent IP for all outbound traffic, ensuring those connections work properly.
If you’re running Kubernetes in the cloud, a common solution is to use your cloud provider’s managed NAT gateway service. These are easy to use, but they can come at a cost. In AWS, Azure, and Google Cloud, cloud-managed NAT gateways charge both an hourly fee and a per-gigabyte data processing fee. For high-traffic deployments, those charges can quickly add up, sometimes even exceeding your compute costs.
The good news: with Calico, you can handle NAT from inside your Kubernetes cluster, avoiding cloud NAT gateway fees and giving you more control over how egress Continue reading
It has taken nearly two decades and an immense amount of work by millions of people for high performance computing to go mainstream with GenAI. …
Why Is Japan Still Investing In Custom Floating Point Accelerators? was written by Timothy Prickett Morgan at The Next Platform.
Over the past few days Cloudflare has been notified through our vulnerability disclosure program and the certificate transparency mailing list that unauthorized certificates were issued by Fina CA for 1.1.1.1, one of the IP addresses used by our public DNS resolver service. From February 2024 to August 2025, Fina CA issued twelve certificates for 1.1.1.1 without our permission. We did not observe unauthorized issuance for any properties managed by Cloudflare other than 1.1.1.1.
We have no evidence that bad actors took advantage of this error. To impersonate Cloudflare's public DNS resolver 1.1.1.1, an attacker would not only require an unauthorized certificate and its corresponding private key, but attacked users would also need to trust the Fina CA. Furthermore, traffic between the client and 1.1.1.1 would have to be intercepted.
While this unauthorized issuance is an unacceptable lapse in security by Fina CA, we should have caught and responded to it earlier. After speaking with Fina CA, it appears that they issued these certificates for the purposes of internal testing. However, no CA should be issuing certificates for domains and IP addresses without checking control. At Continue reading
The Linux For Network Engineers (LFNE) Docker container has been refreshed and is now built on Ubuntu 24.04 LTS. For those new to it, LFNE is a ready-to-use Linux environment preloaded with the most popular tools used by network engineers—from packet capture and traffic analysis utilities to configuration helpers and scripting support. Instead of spending […]
<p>The post Linux For Network Engineers (LFNE) – Now on Ubuntu 24.04 first appeared on IPNET.</p>
TL&DR: Over 3000
A few weeks ago, Christian opened an issue describing how netlab breaks when the lab topology has more than 250 devices. We fixed that, only to get into another morass: some code has complexity higher than O(n) (meaning that going from 100 to 200 devices makes things more than twice as slow). Christian is working on one of those problems at the moment (it’s not that his ginormous labs won’t start, it just takes a long time), and I decided it’s time to polish a few other bits of the code.
There is no question that one of the smartest things that chip designer, packager, and manufacturing process manager Marvell Technology did was to shell out $650 million in May 2019 to buy Avera Semiconductor. …
Marvell’s Custom XPU Pipeline Is A Declaration Of AI Independence was written by Timothy Prickett Morgan at The Next Platform.