IBM Will Catch A Piece Of The GenAI Wave With Next-Gen Systems

We don’t normally spend a lot of time writing about IBM mainframes, but these big iron systems drive a lot of transactions in the world – transactions flush with demographics and context that will feed into AI models – and will be doing native and integrated AI processing for the applications that push those applications.

IBM Will Catch A Piece Of The GenAI Wave With Next-Gen Systems was written by Timothy Prickett Morgan at The Next Platform.

Recap: KubeCon + CloudNativeCon Europe 2025

When I got the assignment to attend KubeCon 1st of April I thought it was an April prank, but as the date got closer I realized—this is for real and I’ll be on the ground in London at the tenth anniversary of cloud native computing. I’ve seen a lot of tech events during my years in the industry while trying not to get replaced by AI and I have to say this one stands out!

Image source: CNCF YouTube Channel

Here is my recap of KubeCon + CloudNativeCon Europe 2025.

CalicoCon 2025

CalicoCon is an event that happens twice every year, as a co-located event during KubeCon NA and EU. It’s a free event that allows you to learn about Tigera’s vision for the future of networking and security in the cloud. There’s also an after-party to celebrate our community and people like you who are on this journey with us!

This year our main focus was on Calico v3.30, our upcoming release that will add a lot of anticipated features to Calico, unlocking things like observability, staged network policy, and gateway api. CalicoCon brought together cloud-native enthusiasts to explore the latest advancements in Calico and Kubernetes networking.

Continue reading

NAN090: Examining Network Automation’s Present and Future With Chris Grundemann

Today we explore the current state of network automation with Chris Grundemann, a co-founder of Network Automation Forum. Chris gives the history of the Network Automation Forum and AutoCon conference, which aims to blend technology and community. He shares ideas for advancing network automation, and invites other network engineers to participate in shaping the future... Read more »

Nvidia NeMo Microservices For AI Agents Hits The Market

Last year, amid all the talk of the “Blackwell” datacenter GPUs that were launched at last year’s GPU Technicval Conference, Nvidia also introduced the idea of Nvidia Inference Microservices, or NIMs, which are prepackaged enterprise-grade generative AI software stacks that companies can use as virtual copilots to add custom AI software to their own applications.

Nvidia NeMo Microservices For AI Agents Hits The Market was written by Jeffrey Burt at The Next Platform.

D2DO270: Spacelift Is Your Infrastructure Glue (Sponsored)

Working with multiple Infrastructure as Code (IAC) tools can be painful. Spacelift provides a platform that operates on top of disparate IaC tools, including Ansible, Kubernetes, Pulumi, Terraform, and OpenTofu. Spacelift helps build the automation with and between these tools, creating graphs of graphs that make your dependencies just work. On today’s show, we talk... Read more »

PP059: News Roundup – Oracle Plays Breach Word Games, Fast Flux Worries CISA, AI Package Hallucinations, and More

Once a month, the Packet Protector podcast likes to see what’s going on out there via our news roundup. There’s a lot happening! Today we discuss Fortinet warning that a threat actor has found a way to maintain read-only access on Fortinet devices even if you’ve applied the patch for the original threat. Avanti VPNs... Read more »

New year, no shutdowns: the Q1 2025 Internet disruption summary

Cloudflare’s network spans more than 330 cities in over 125 countries, where we interconnect with over 13,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions at both a local and national level, as well as at a network level.

As we have noted in the past, this post is intended as a summary overview of observed and confirmed disruptions, and is not an exhaustive or complete list of issues that have occurred during the quarter. A larger list of detected traffic anomalies is available in the Cloudflare Radar Outage Center. Note that both bytes-based and request-based traffic graphs are used within the post to illustrate the impact of the observed disruptions — the choice of metric was generally made based on which better illustrated the impact of the disruption.

In the first quarter of 2025, we observed a significant number of Internet disruptions due to cable damage and power outages. Severe storms caused outages in Ireland and Réunion, and an earthquake caused ongoing connectivity issues Continue reading

How ARP Killed a Static Route

The amount of weird stuff we discover in netlab integration tests is astounding, or maybe I have a knack for looking into the wrong dark corners (my wife would definitely agree with that). Today’s special: when having two next hops kills a static route.

TL&DR: default ARP settings on a multi-subnet Linux host are less than optimal.

We use these principles when creating netlab integration tests:

  • They should contain a single device-under-test and a bunch of attached probes.
  • They should test a single feature.
  • They should not rely on the device-under-test. All validation has to be done on probes.

How do you test static routes under these restrictions? Here’s what we did:

Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments

What is UV? Astral's uv is a fast, all-in-one Python package and project manager written in Rust that unifies and accelerates Python development workflows by replacing multiple tools and actions including: pip, pip-tools, poetry, pipx, pyenv, virtualenv, and twine initializing a git repository creating base files like .gitignore and pyproject.toml (think of this as requirements.txt READ MORE

The post Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments appeared first on The Gratuitous Arp.

AI Metrics with Prometheus and Grafana

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
env RT_IMAGE=ai-metrics ./start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
     Continue reading

Tech Bytes: Network Observability AIOps Tips For Success (Sponsored)

Today on the Tech Bytes podcast we’re talking AI readiness with sponsor Broadcom. More specifically, getting your network observability ready to support AI operations. This isn’t just a hardware or software issue. It’s also a data issue. We’ll get some tips with our guest Jeremy Rossbach. Jeremy is Chief Technical Evangelist and Lead Product Marketing... Read more »

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing

Though BGP supports the traditional Flow-based Layer 3 Equal Cost Multi-Pathing (ECMP) traffic load balancing method, it is not the best fit for a RoCEv2-based AI backend network. This is because GPU-to-GPU communication creates massive elephant flows, which RDMA-capable NICs transmit at line rate. These flows can easily cause congestion in the backend network.

In ECMP, all packets of a single flow follow the same path. If that path becomes congested, ECMP does not adapt or reroute traffic. This leads to uneven bandwidth usage across the network. Some links become overloaded, while others remain idle. In AI workloads, where multiple high-bandwidth flows occur at the same time, this imbalance can degrade performance.

Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job.

This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often Continue reading

🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

In my previous post, I shared my experience using Vibe coding and highlighted one of the biggest challenges in that workflow: AI coding tools often lack awareness of what’s happening in the browser when you run your app. This leads to a frustrating dev loop: you’re forced to constantly copy-paste screenshots, console errors, and network … Continue reading 🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

TSMC: The Second Most Profitable Company In The AI Revolution

The AI boom has been very, very good to Taiwan Semiconductor Manufacturing Co, which is positioned to do well if Nvidia continues with its hegemony over AI training and inference or if the rebel alliance forms behind AMD or if the hyperscalers and cloud builders dedicate a substantial portion of their capital budgets to etching and packaging homegrown compute engines.

TSMC: The Second Most Profitable Company In The AI Revolution was written by Timothy Prickett Morgan at The Next Platform.

1 2 3 3,780