Archive

Category Archives for "Networking"

Multi-vendor support for dropped packet notifications


The sFlow Dropped Packet Notification Structures extension was published in October 2020. Extending sFlow to provide visibility into dropped packets offers significant benefits for network troubleshooting, providing real-time network wide visibility into the specific packets that were dropped as well the reason the packet was dropped. This visibility instantly reveals the root cause of drops and the impacted connections. Packet discard records complement sFlow's existing counter polling and packet sampling mechanisms and share a common data model so that all three sources of data can be correlated, for example, packet sampling reveals the top consumers of bandwidth on a link, helping to get to the root cause of congestion related packet drops reported for the link.

Today the following network operating systems include support for the drop notification extension in their sFlow agent implementations:

Two additional sFlow dropped packet notification implementations are in the pipeline and should be available later this year:

CNCF and Synadia Reach an Agreement on NATS

Last month, Synadia, the primary maintainer of the NATS messaging system, tried to withdraw NATS from the open source governance of Cloud Native Computing Foundation (CNCF). Its motive was to try to profit from NATS by Synadia had previously donated NATS to the Cloud Native Computing Foundation (CNCF) in 2018. Now, the Cloud Native Computing Foundation (CNCF) and NATS project will continue in the CNCF’s cloud native open source ecosystem with Synadia’s continued support and involvement. A spokesperson for Synadia did not immediately respond to a TNS request for comment. Not So Fast Synadia had planned to regain control of the

Scaling with safety: Cloudflare’s approach to global service health metrics and software releases

Has your browsing experience ever been disrupted by this error page? Sometimes Cloudflare returns "Error 500" when our servers cannot respond to your web request. This inability to respond could have several potential causes, including problems caused by a bug in one of the services that make up Cloudflare's software stack.

We know that our testing platform will inevitably miss some software bugs, so we built guardrails to gradually and safely release new code before a feature reaches all users. Health Mediated Deployments (HMD) is Cloudflare’s data-driven solution to automating software updates across our global network. HMD works by querying Thanos, a system for storing and scaling Prometheus metrics. Prometheus collects detailed data about the performance of our services, and Thanos makes that data accessible across our distributed network. HMD uses these metrics to determine whether new code should continue to roll out, pause for further evaluation, or be automatically reverted to prevent widespread issues.

Cloudflare engineers configure signals from their service, such as alerting rules or Service Level Objectives (SLOs). For example, the following Service Level Indicator (SLI) checks the rate of HTTP 500 errors over 10 minutes returned from a service in our software stack.

sum(rate(http_request_count{code="500"}[10m]))  Continue reading

AI for Network Engineers: Rail Desings in GPU Fabric

 When building a scalable, resilient GPU network fabric, the design of the rail layer, the portion of the topology that interconnects GPU servers via Top-of-Rack (ToR) switches, plays a critical role. This section explores three different models: Multi-rail-per-switch, Dual-rail-per-switch, and Single-rail-per-switch. All three support dual-NIC-per-GPU designs, allowing each GPU to connect redundantly to two separate switches, thereby removing the Rail switch as a single point of failure.


Multi-Rail-per-Switch

In this model, multiple small subnets and VLANs are configured per switch, with each logical rail mapped to a subset of physical interfaces. For example, a single 48-port switch might host four or eight logical rails using distinct Layer 2 and Layer 3 domains. Because all logical rails share the same physical device, isolation is logical. As a result, a hardware or software failure in the switch can impact all rails and their associated GPUs, creating a large failure domain.


This model is not part of NVIDIA’s validated Scalable Unit (SU) architecture but may suit test environments, development clusters, or small-scale GPU fabrics where hardware cost efficiency is a higher priority than strict fault isolation. From a CapEx perspective, multi-rail-per-switch is the most economical, requiring fewer switches. 


Figure 13-10 illustrates the Continue reading

From Python to Go 020. Concurrency and Parallelism Of Code Executions.

Hello my friend,

Today’s topic is critical to complete full picture of software development for network automation. Today’s topic is what allows you to conduct your tasks within meaningful time frame, especially when you have a lot of network devices, servers, virtual machines to manage. Today’s topic is concurrency of code execution in Python and Golang.

What Other Programming Languages Makes Sense To Study?

There are more than 100 programming languages out there. Some of them are quite universal and allow development of almost any kind of application. Others are more specific. Python is probably the most universal programming language, from what I’ve worked with or heard of. It can be used in infrastructure management at scale (e.g., OpenStack is written in Python), web applications, data science and many more. Golang is much more low-level compared to Python and, therefore, way more performant. Various benchmarks available online suggests that the same business tasks could be 3-30 times quicker in Golang compared to Python; therefore Golang is suitable for system programing (e.g, Kubernetes and Docker are created in Go). That’s what we cover in our blogs. Apart from them there are a lot of other: C/C++ if you Continue reading

Gateway API or Ingress: A Developer’s Guide to Kubernetes Routing

Managing how applications within a Kubernetes cluster communicate with the outside world is a fundamental challenge. For years, the standard approach involved the multiple annotations to configure the ingress resource, which makes it confusing and cumbersome to manage. Recognizing these limitations, the Kubernetes community developed the Gateway API, a more powerful, flexible and standardized successor designed to streamline traffic management. The Ingress controller, a separate piece of software running in the Continue reading

Hedge 268: Will AI take our jobs?

One of the “great fears” advancing AI unlocks is that most of our jobs can, and will, be replaced by various forms of AI. Join us on this episode of the Hedge as Jonathan Mast at White Beard Strategies, Tom Ammon, and Russ White discuss whether we are likely to see a net loss, gain, or wash in jobs as companies deploy LLMS, and other potential up- and down-sides.
 

download

Thirteen new MCP servers from Cloudflare you can use today

You can now connect to Cloudflare's first publicly available remote Model Context Protocol (MCP) servers from Claude.ai (now supporting remote MCP connections!) and other MCP clients like Cursor, Windsurf, or our own AI Playground. Unlock Cloudflare tools, resources, and real time information through our new suite of MCP servers including: 

Server Description
Cloudflare Documentation server Get up to date reference information from Cloudflare Developer Documentation
Workers Bindings server Build Workers applications with storage, AI, and compute primitives
Workers Observability server Debug and get insight into your Workers application’s logs and analytics
Container server Spin up a sandbox development environment
Browser rendering server Fetch web pages, convert them to markdown and take screenshots
Radar server Get global Internet traffic insights, trends, URL scans, and other utilities
Logpush server Get quick summaries for Logpush job health
AI Gateway server Search your logs, get details about the prompts and responses
AutoRAG server List and search documents on your AutoRAGs
Audit Logs server Query audit logs and generate reports for review
DNS Analytics server Optimize DNS performance and debug issues based on current set up
Digital Experience Monitoring server Get quick insight on critical applications for your organization
Cloudflare One CASB Continue reading

MCP Demo Day: How 10 leading AI companies built MCP servers on Cloudflare

Today, we're excited to collaborate with Anthropic, Asana, Atlassian, Block, Intercom, Linear, PayPal, Sentry, Stripe, and Webflow to bring a whole new set of remote MCP servers, all built on Cloudflare, to enable Claude users to manage projects, generate invoices, query databases, and even deploy full stack applications — without ever leaving the chat interface. 

Since Anthropic’s introduction of the Model Context Protocol (MCP) in November, there’s been more and more excitement about it, and it seems like a new MCP server is being released nearly every day. And for good reason!  MCP has been the missing piece to make AI agents a reality, and helped define how AI agents interact with tools to take actions and get additional context.

But to date, end-users have had to install MCP servers on their local machine to use them. Today, with Anthropic’s announcement of Integrations, you can access an MCP server the same way you would a website: type a URL and go.

At Cloudflare, we’ve been focused on building out the tooling that simplifies the development of remote MCP servers, so that our customers’ engineering teams can focus their time on building out the MCP tools for their Continue reading

Bringing streamable HTTP transport and Python language support to MCP servers

We’re continuing to make it easier for developers to bring their services into the AI ecosystem with the Model Context Protocol (MCP). Today, we’re announcing two new capabilities:

  • Streamable HTTP Transport: The Agents SDK now supports the new Streamable HTTP transport, allowing you to future-proof your MCP server. Our implementation allows your MCP server to simultaneously handle both the new Streamable HTTP transport and the existing SSE transport, maintaining backward compatibility with all remote MCP clients.

  • Deploy MCP servers written in Python: In 2024, we introduced first-class Python language support in Cloudflare Workers, and now you can build MCP servers on Cloudflare that are entirely written in Python.

Click “Deploy to Cloudflare” to get started with a remote MCP server that supports the new Streamable HTTP transport method, with backwards compatibility with the SSE transport. 

Streamable HTTP: A simpler way for AI agents to communicate with services via MCP

The MCP spec was updated on March 26 to introduce a new transport mechanism for remote MCP, called Streamable HTTP. The new transport simplifies how AI agents can interact with services by using a single HTTP endpoint for sending and receiving responses between the client and the Continue reading

OSPF Summary LSA Loop Prevention

A networking-focused entity known only as humblegrumble sent me the following question after reading my When OSPF Becomes a Distance Vector Protocol article:

How do A1 and A2 know not to advertise a Type-3 summary LSA generated from area 1 prefixes back into area 1?

He’s right. There is no “originating area” information in the type-3 LSA, so how does an ABR know not to reinsert the type-3 LSA generated by another ABR back into the area?

TL&DR: The OSPF route selection process takes care of that.

PP060: Subsea Cables and the Watery Risks to Critical Infrastructure

Submarine cables are a hidden wonder. These fiber optic bundles carry data and voice traffic around the world and serve as critical global links for communication and commerce. Today on Packet Protector, guest Andy Champagne dives into the history of submarine cables, the technological and operational advancements that allow voice and data to travel hundreds... Read more »

HS102: IT’s Role In AI (Sponsored)

AI can impact an enterprise in several ways: making individuals more productive, making products and services more effective, and making it easier for customers and partners to do business. IT plays a critical role in enabling AI to have these impacts. On today’s sponsored Heavy Strategy, Cisco CIO Fletcher Previn explains how to locate AI use... Read more »
1 2 3 3,434