Though BGP supports the traditional Flow-based Layer 3 Equal Cost Multi-Pathing (ECMP) traffic load balancing method, it is not the best fit for a RoCEv2-based AI backend network. This is because GPU-to-GPU communication creates massive elephant flows, which RDMA-capable NICs transmit at line rate. These flows can easily cause congestion in the backend network.
In ECMP, all packets of a single flow follow the same path. If that path becomes congested, ECMP does not adapt or reroute traffic. This leads to uneven bandwidth usage across the network. Some links become overloaded, while others remain idle. In AI workloads, where multiple high-bandwidth flows occur at the same time, this imbalance can degrade performance.
Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job.
This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often Continue reading
The AI boom has been very, very good to Taiwan Semiconductor Manufacturing Co, which is positioned to do well if Nvidia continues with its hegemony over AI training and inference or if the rebel alliance forms behind AMD or if the hyperscalers and cloud builders dedicate a substantial portion of their capital budgets to etching and packaging homegrown compute engines. …
TSMC: The Second Most Profitable Company In The AI Revolution was written by Timothy Prickett Morgan at The Next Platform.
Hello my friend,
So far we’ve covered all means to interact with network devices, which are meaningful in our opinion: SSH, NETCONF/YANG, and GNMI/YANG. There is one more protocol, which exists for managing network devices, which is called RESTCONF, which is application of REST API to network devices. From our experience, its support across network vendors is very limited; therefore, we don’t cover it. However, REST API itself is immensely important, as it is still the most widely used protocol for applications to talk to each other. And this is the focus for today’s blog.
Generative AI, Agentic AI, all other kinds of AI is absolutely useful things. The advancements there are very quick and we ourselves using them in our projects. At the same time, if you don’t know how to code, how to solve algorithmic tasks, how can you reason if the solution provided by AI is correct? If that optimal? And moreover, when it breaks, because every software breaks sooner or later, how can you fix it? That’s why we believe it is absolutely important to learn software development, tools and algorithms. Perhaps, more Continue reading
As part of the pre-briefings ahead of the Google Cloud Next 2025 conference last week and then during the keynote address, the top brass at Google kept comparing a pod of “Ironwood” TPU v7p systems to the “El Capitan” supercomputer at Lawrence Livermore National Laboratory. …
Stacking Up Google’s “Ironwood” TPU Pod To Other AI Supercomputers was written by Timothy Prickett Morgan at The Next Platform.
I apologize for the rant; I have to vent my frustration with people whose quantity of opinions seems to be exceeding their experience (or maybe they’re coming from an alternate universe with different laws of physics, which would be way cool but also unlikely). You’ve been warned; please feel free to move on or skip the rant part of the blog post.
Rant mode: ON
This is the (unedited) gem I received after making some of my EVPN videos public:
Making a graphics card for gamers is one thing, but manufacturing a rackscale supercomputer with over 600,000 components that burns 120 kilowatts of power, that has over 5,000 copper cables for an all-to-all interconnect mesh for 72 dual-chip compute engines, and that weighs over 3,000 pounds is another thing entirely. …
Nvidia Sacrifices Profits To Preserve Revenues In The US was written by Timothy Prickett Morgan at The Next Platform.
When I was updating the Network Migration with BGP Local-AS Feature blog post, I wanted to execute the same command (show ip bgp) on all routers in my network.
Not a problem: since Dan Partelly added the netlab exec command, it’s as simple as netlab exec * show ip bgp. Well, not exactly; there are still a few quirks.
Cloudflare is not just another technology company. It’s a mission-driven force, committed to helping build a better Internet; one that is faster, safer, and more resilient. That mission is more critical than ever as organizations worldwide navigate an increasingly complex digital landscape, rife with cyber threats, regulatory challenges, and the need for scalable, cost-effective solutions.
In EMEA, that mission has special significance. The region is a patchwork of diverse markets, industries, and regulatory environments. It demands a partner-centric approach, one that empowers businesses of all sizes to harness Cloudflare’s comprehensive connectivity cloud platform to protect, connect, and accelerate their operations. That’s why I joined Cloudflare as VP of EMEA Partnerships.
Every great company has an inflection point, a moment when the market, the strategy, and the execution align to create unstoppable momentum. Cloudflare is at that moment now.
With record revenue growth, increasing traction among large customers, and an expanding suite of Zero Trust, AI, and network security solutions, Cloudflare is emerging as the partner of choice for enterprises and service providers across EMEA .
But what excites me most is the people, the opportunity to build a team in EMEA that is world-class in its expertise, Continue reading