As explained in the preceding chapter, âEgress Interface Congestions,â both the Rail switch links to GPU servers and the inter-switch links can become congested during gradient synchronization. It is essential to implement congestion control mechanisms specifically designed for RDMA workloads in AI fabric back-end networks because congestion slows down the learning process and even a single packet loss may restart the whole training process.
This section begins by introducing Explicit Congestion Notification (ECN) and Priority-based Flow Control (PFC), two foundational technologies used in modern lossless Ethernet networks. ECN allows switches to mark packets, rather than dropping them, when congestion is detected, enabling endpoints to react proactively. PFC, on the other hand, offers per-priority flow control, which can pause selected traffic classes while allowing others to continue flowing.
Finally, we describe how Datacenter Quantized Congestion Notification (DCQCN) combines ECN and PFC to deliver a scalable and lossless transport mechanism for RoCEv2 traffic in AI clusters.
I am pleased to introduce the work of one of my students, who developed a […]
The post Professional Corporate Network Simulation in Packet Tracer first appeared on Brezular's Blog.
It is almost the end of Developer Week and we havenât talked about containers: until now. As some of you may know, weâve been working on a container platform behind the scenes for some time.
In late June, we plan to release Containers in open beta, and today weâll give you a sneak peek at what makes it unique.
Workers are the simplest way to ship software around the world with little overhead. But sometimes you need to do more. You might want to:
Run user-generated code in any language
Execute a CLI tool that needs a full Linux environment
Use several gigabytes of memory or multiple CPU cores
Port an existing application from AWS, GCP, or Azure without a major rewrite
Cloudflare Containers let you do all of that while being simple, scalable, and global.
Through a deep integration with Workers and an architecture built on Durable Objects, Workers can be your:
API Gateway: Letting you control routing, authentication, caching, and rate-limiting before requests reach a container
Service Mesh: Creating private connections between containers with a programmable routing layer
Orchestrator: Allowing you to write custom scheduling, scaling, and health checking logic for your containers
Instead Continue reading
With quick access to flexible infrastructure and innovative AI tools, startups are able to deploy production-ready applications with speed and efficiency. Cloudflare plays a pivotal role for countless applications, empowering founders and engineering teams to build, scale, and accelerate their innovations with ease â and without the burden of technical overhead. And when applicable, initiatives like our Startup Program and Workers Launchpad offer the tooling and resources that further fuel these ambitious projects.
Cloudflare recently announced AI agents, allowing developers to leverage Cloudflare to deploy agents to complete autonomous tasks. Weâre already seeing some great examples of startups leveraging Cloudflare as their platform of choice to invest in building their agent infrastructure. Read on to see how a few up-and-coming startups are building their AI agent platforms, powered by Cloudflare.
Founded in 2023, Lamatic.ai empowers SaaS startups to seamlessly integrate intelligent AI agents into their products. Lamatic.ai simplifies the deployment of AI agents by offering a fully managed lifecycle with scalability and security in mind. SaaS providers have been leveraging Lamatic to replatform their AI workflows via a no-code visual builder to reduce technical debt Continue reading
Today, weâre sharing a preview of a new feature that makes it easier to build cross-cloud apps: Workers VPC.Â
Workers VPC is our take on the traditional virtual private cloud (VPC), modernized for a network and compute that isnât tied to a single cloud region. And weâre complementing it with Workers VPC Private Links to make building across clouds easier. Together, they introduce two new capabilities to Workers:
A way to group your appsâ resources on Cloudflare into isolated environments, where only resources within a Workers VPC can access one another, allowing you to secure and segment app-to-app traffic (a âWorkers VPCâ).
A way to connect a Workers VPC to a legacy VPC in a public or private cloud, enabling your Cloudflare resources to access your resources in private networks and vice versa, as if they were in a single VPC (the âWorkers VPC Private Linkâ).
Workers VPC and Workers VPC Private Link enable bidirectional connectivity between Cloudflare and external clouds
When linked to an external VPC, Workers VPC makes the underlying resources directly addressable, so that application developers can think at the application layer, without dropping down to the network layer. Think of this like a Continue reading
During Cloudflareâs Birthday Week in September 2024, we introduced a revamped Startup Program designed to make it easier for startups to adopt Cloudflare through a new credits system. This update focused on better aligning the program with how startups and developers actually consume Cloudflare, by providing them with clearer insight into their projected usage, especially as they approach graduation from the program.
Today, weâre excited to announce an expansion to that program: new credit tiers that better match startups at every stage of their journey. But before we dive into whatâs new, letâs take a quick look at what the Startup Program is and why it exists.
Cloudflare for Startups provides credits to help early-stage companies build the next big idea on our platform. Startups accepted into the program receive credits valid for one year or until theyâre fully used, whichever comes first.
Beyond credits, the program includes access to up to three domains with enterprise-level services, giving startups the same advanced tools we provide to large companies to protect and accelerate their most critical applications.
We know that building a startup is expensive, and Cloudflare is uniquely positioned to support the full-stack Continue reading
Cloudflare plays a significant role in supporting the Internetâs infrastructure. As a reverse proxy by approximately 20% of all websites, we sit directly in the request path between users and the origin, helping to improve performance, security, and reliability at scale. Beyond that, our global network powers services like delivery, Workers, and R2 â making Cloudflare not just a passive intermediary, but an active platform for delivering and hosting content across the Internet.
Since Cloudflareâs launch in 2010, we have collaborated with the National Center for Missing and Exploited Children (NCMEC), a US-based clearinghouse for reporting child sexual abuse material (CSAM), and are committed to doing what we can to support identification and removal of CSAM content.
Members of the public, customers, and trusted organizations can submit reports of abuse observed on Cloudflareâs network. A minority of these reports relate to CSAM, which are triaged with the highest priority by Cloudflareâs Trust & Safety team. We will also forward details of the report, along with relevant files (where applicable) and supplemental information to NCMEC.
The process to generate and submit reports to NCMEC involves multiple steps, dependencies, and error handling, which quickly became complex under Continue reading
When most people think of segment routing (SR), they think of SRv6–using IPv6 addresses as segment IDs, and breaking the least significant /64 to create microsids for service differentiation. This is not, however, the only way to implement and deploy SR. The alternative is SR using MPLS labels, or SR/MPLS. Hemant Sharma joins Tom Ammon and Russ White to discuss SR/MPLS, why operators might choose MPLS over IPv6 SIDs, and other topics related to SR/MPLS.
When it comes to artificial intelligence, context is everything. The same thing holds true for human intelligence, so it stands to reason that it translates to AI since we created it in our own image. …
The AI Factory: 12,000 Years In The Making, And Absolutely Inevitable was written by Timothy Prickett Morgan at The Next Platform.
Any public certification authority (CA) can issue a certificate for any website on the Internet to allow a webserver to authenticate itself to connecting clients. Take a moment to scroll through the list of trusted CAs for your web browser (e.g., Chrome). You may recognize (and even trust) some of the names on that list, but it should make you uncomfortable that any CA on that list could issue a certificate for any website, and your browser would trust it. Itâs a castle with 150 doors.
Certificate Transparency (CT) plays a vital role in the Web Public Key Infrastructure (WebPKI), the set of systems, policies, and procedures that help to establish trust on the Internet. CT ensures that all website certificates are publicly visible and auditable, helping to protect website operators from certificate mis-issuance by dishonest CAs, and helping honest CAs to detect key compromise and other failures.
In this post, weâll discuss the history, evolution, and future of the CT ecosystem. Weâll cover some of the challenges we and others have faced in operating CT logs, and how the new static CT API log design lowers the bar for operators, helping to ensure that Continue reading
Since the launch of Workers AI in September 2023, our mission has been to make inference accessible to everyone.
Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. Youâll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos â we try to solve problems through clever engineering so that we are able to do more with less.
Today, weâre excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, weâll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.
Weâre excited to roll out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, Continue reading
The HPC centers of the world like fast networks and compute, but they are also always working under budget constraints unlike their AI peers out there in the enterprise, where money seems to be unlimited to what sometimes looks like an irrationally exuberant extent. …
Google Woos HPC Centers With Fast CPUs And Networks was written by Timothy Prickett Morgan at The Next Platform.
The availability of the Cisco IOS XR Release 25.1.1 brings sFlow dropped packet notification support to Cisco 8000 series routers, making it easy to capture and analyze packets dropped at router ingress, aiding in understanding blocked traffic types, identifying potential security threats, and optimizing network performance.
sFlow Configuration for Traffic Monitoring and Analysis describes the steps to enable sFlow and configure packet sampling and interface counter export from a Cisco 8000 Series router to a remote sFlow analyzer.
Note: Devices using NetFlow or IPFIX must transition to sFlow for regular sampling before utilizing the dropped packet feature, ensuring compatibility and consistency in data analysis.
Router(config)#monitor-session monitor1 Router(config)#destination sflow EXP-MAP Router(config)#forward-drops rx
Configure a monitor-session with the new destination sflow option to export dropped packet notifications (which include ingress interface, drop reason, and header of dropped packet) to the configured sFlow analyzer.
Cisco lists the following benefits of streaming dropped packets in the configuration guide:
Super Slurper is Cloudflareâs data migration tool that is designed to make large-scale data transfers between cloud object storage providers and Cloudflare R2 easy. Since its launch, thousands of developers have used Super Slurper to move petabytes of data from AWS S3, Google Cloud Storage, and other S3-compatible services to R2.
But we saw an opportunity to make it even faster. We rearchitected Super Slurper from the ground up using our Developer Platform â building on Cloudflare Workers, Durable Objects, and Queues â and improved transfer speeds by up to 5x. In this post, weâll dive into the original architecture, the performance bottlenecks we identified, how we solved them, and the real-world impact of these improvements.
Super Slurper originally shared its architecture with SourcingKit, a tool built to bulk import images from AWS S3 into Cloudflare Images. SourcingKit was deployed on Kubernetes and ran alongside the Images service. When we started building Super Slurper, we split it into its own Kubernetes namespace and introduced a few new APIs to make it easier to use for the object storage use case. This setup worked well and helped thousands of developers move data to Continue reading
Today, weâre launching the open beta of Pipelines, our streaming ingestion product. Pipelines allows you to ingest high volumes of structured, real-time data, and load it into our object storage service, R2. You donât have to manage any of the underlying infrastructure, worry about scaling shards or metadata services, and you pay for the data processed (and not by the hour). Anyone on a Workers paid plan can start using it to ingest and batch data â at tens of thousands of requests per second (RPS) â directly into R2.
But this is just the tip of the iceberg: you often want to transform the data youâre ingesting, hydrate it on-the-fly from other sources, and write it to an open table format (such as Apache Iceberg), so that you can efficiently query that data once youâve landed it in object storage.
The good news is that weâve thought about that too, and weâre excited to announce that weâve acquired Arroyo, a cloud-native, distributed stream processing engine, to make that happen.
With Arroyo and our just announced R2 Data Catalog, weâre getting increasingly serious about building a data platform that allows you to ingest data across the planet, store Continue reading