If you’ve spent time supporting AI infrastructure, whether that’s a GPU training cluster, a fleet of inference nodes, or a multi-tenant model serving platform, you’ve probably noticed something: the network telemetry tools that served you well in a traditional data center feel slightly out of place here. Not useless. Just not quite designed for this.
The traffic patterns are different. The failure modes are different. The things you need to catch early are different. And if you’re running NetFlow or sFlow collection – which you should be – understanding where that data genuinely helps versus where you’re looking at the wrong instrument is the difference between a useful monitoring stack and a false sense of coverage.
Most of the networking intuition you’ve built over a career was forged on north-south traffic – clients reaching services, users reaching the internet, workloads reaching storage. Even in modern microservices environments with heavy east-west traffic, flows are relatively short-lived, heterogeneous in size, and largely TCP-based with normal congestion dynamics.
AI training breaks most of those assumptions simultaneously.
A distributed training job across a GPU cluster is synchronous in a way that most networked workloads are not. Every GPU in Continue reading
Commit Control is a core safety mechanism in the Noction Intelligent Routing Platform (IRP). It governs how routing changes are applied by enforcing bandwidth-related limits, ensuring that traffic shifts toward providers remain controlled and predictable. These limits are essential for protecting networks from sudden overloads and unintended traffic spikes.Historically, Commit Control has relied on configured bandwidth assumptions. While this works well under stable conditions, real networks are rarely static. Physical interfaces may fail, bonded links can lose members, and available capacity may be reduced without immediate operational awareness. In such cases, Commit Control may continue to operate correctly from a configuration perspective, while the underlying physical capacity has already changed.
With IRP v4.3, we introduce Interface Monitoring, a feature that allows Commit Control to continuously align its decisions with the actual state and capacity of provider-facing interfaces.
Commit Control is designed to answer a critical question: Is it safe to commit a routing change that increases traffic toward this provider?
The answer depends not only on policy and configuration, but also on whether the provider connection can physically handle that traffic.
Physical failures are an unavoidable part of Continue reading
The post BGP Routing Information Base (RIB) Deep Dive appeared first on Noction.
The post BGP Routing Information Base (RIB) Deep Dive appeared first on Noction.